-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable 0 workers replicaSpec for pytorchjob #6201
base: master
Are you sure you want to change the base?
Enable 0 workers replicaSpec for pytorchjob #6201
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Code Review Agent Run #29781fActionable Suggestions - 1
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
flyteplugins/go/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go
Outdated
Show resolved
Hide resolved
Code Review Agent Run #c188b6Actionable Suggestions - 0Review Details
|
@@ -145,16 +145,17 @@ func (p pytorchOperatorResourceHandler) BuildResource(ctx context.Context, taskC | |||
"Invalid TaskSpecification, unsupported task template version [%v] key", taskTemplate.GetTaskTypeVersion()) | |||
} | |||
|
|||
if *workerReplicaSpec.Replicas <= 0 { | |||
return nil, fmt.Errorf("number of workers must be greater than 0") | |||
jobSpec := kubeflowv1.PyTorchJobSpec{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a small unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pingsutw, I've removed TestBuildResourcePytorchV1WithZeroWorker
as it was intended to fail in the case corrected with this PR. Replaced it with the TestBuildResourcePytorchV1WithDifferentWorkersNumber
test, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One unit test is failing. otherwise, lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now, thanks.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6201 +/- ##
==========================================
- Coverage 37.08% 37.07% -0.01%
==========================================
Files 1318 1318
Lines 132707 132711 +4
==========================================
+ Hits 49208 49209 +1
- Misses 79244 79247 +3
Partials 4255 4255
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
a683ad3
to
33ab59b
Compare
33ab59b
to
7464db6
Compare
Code Review Agent Run #59b7b8Actionable Suggestions - 1
Review Details
|
jobSpec := kubeflowv1.PyTorchJobSpec{} | ||
replicaSpecs := map[commonOp.ReplicaType]*commonOp.ReplicaSpec{ | ||
kubeflowv1.PyTorchJobReplicaTypeMaster: masterReplicaSpec, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider validating that masterReplicaSpec
is not nil before using it in the replicaSpecs
map. A nil masterReplicaSpec
could cause runtime issues.
Code suggestion
Check the AI-generated fix before applying
jobSpec := kubeflowv1.PyTorchJobSpec{} | |
replicaSpecs := map[commonOp.ReplicaType]*commonOp.ReplicaSpec{ | |
kubeflowv1.PyTorchJobReplicaTypeMaster: masterReplicaSpec, | |
} | |
jobSpec := kubeflowv1.PyTorchJobSpec{} | |
if masterReplicaSpec == nil { | |
return nil, fmt.Errorf("master replica spec cannot be nil") | |
} | |
replicaSpecs := map[commonOp.ReplicaType]*commonOp.ReplicaSpec{ | |
kubeflowv1.PyTorchJobReplicaTypeMaster: masterReplicaSpec, | |
} |
Code Review Run #59b7b8
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Signed-off-by: punker <[email protected]>
Co-authored-by: Flyte Bot <[email protected]> Signed-off-by: Gleb <[email protected]> Signed-off-by: punker <[email protected]>
Signed-off-by: punker <[email protected]>
Signed-off-by: punker <[email protected]>
Signed-off-by: punker <[email protected]>
a71e27b
to
644ce84
Compare
Code Review Agent Run #216032Actionable Suggestions - 0Review Details
|
Why are the changes needed?
Currently, PyTorchJob allows not to specify Worker in case only one worker (Master) is needed for the job. Current functionality of the plugin doesn't allow to do so, however, in some cases we still want to achieve a Master-only PyTorchJob.
What changes were proposed in this pull request?
Made a change so
pytorch
plugin is no longer complains about a 0-replica worker, but instead - if replicas=0 is passed, plugin just doesn't render it.I understand this isn't a best design for the moment, but that is a least-invasive approach to achieve it without changing e.g
pyflyte
validation behavior (if e.g we accept that worker=None could be passed to the PyTorchtask_config
.How was this patch tested?
We are running the fork code in our production environment
Check all the applicable boxes
Summary by Bito
Enhanced PyTorch operator plugin to support master-only configurations by removing the requirement for worker replicas. Modified replica specification logic to make worker specs optional. Updated test suite to validate both zero-worker and single-worker configurations.Unit tests added: True
Estimated effort to review (1-5, lower is better): 2