-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Use SSA to reconcile TrainJob components #2431
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold Requires kubernetes-sigs/jobset#782 |
@andreyvelich @tenzen-y @varshaprasad96 please take a look when you can. |
8e002f8
to
465e443
Compare
go.mod
Outdated
@@ -19,7 +19,7 @@ require ( | |||
sigs.k8s.io/controller-runtime v0.19.1 | |||
sigs.k8s.io/jobset v0.5.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify main branch, and then let us check if it should work fine as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I assumed you preferred to wait until the upcoming JobSet release :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current trainer development is under alpha stage. So, I don't care the library stability :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @astefanutti Should we merge this PR once JobSet release v0.8.0 ?
I think, we are planning to release it this week: kubernetes-sigs/jobset#774
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've replaced the JobSet API version to v0.8.0-devel for now.
I'm working on adapting the unit tests but the main controller should work fine.
0e92a8c
to
16a000a
Compare
c9ff23c
to
4d7f02a
Compare
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
My first question is could we avoid unstructured and then keep using client.Object typed?
I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430
go 1.23.0 | ||
|
||
toolchain go1.23.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use these toolchains? Who is bringing this? JobSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the replace
added to override the transitive dependencies coming from the newer JobSet version, go mod tidy
keeps adding this. I'd need to dig into it more to understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, interesting. Actually, Go < 1.23.4, they enforce to specify toolchain. But after the Go 1.23.4, the restrictions are relaxed. So, IIUC, this technically could be removed.
Or, we may just bump to Kube 1.32 as library. @andreyvelich Do you want to avoid bumping kube lib to 1.32 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should try to avoid toolchain here if that is possible.
Let's bump k8s to 1.32
Ultimately the I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA. |
Note I still need to update the newly introduced MPI plugin to rely on apply configurations. |
Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA |
Yeah, absolutely |
What this PR does / why we need it:
Replace Get/Create/Update logic with SSA to reconcile TrainJob components.
I've successfully tested it with https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb.
Depends on kubernetes-sigs/jobset#782.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #2297
Checklist: