Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Use SSA to reconcile TrainJob components #2431

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Feb 11, 2025

What this PR does / why we need it:

Replace Get/Create/Update logic with SSA to reconcile TrainJob components.

I've successfully tested it with https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb.

Depends on kubernetes-sigs/jobset#782.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2297

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@astefanutti
Copy link
Contributor Author

/hold

Requires kubernetes-sigs/jobset#782

@astefanutti
Copy link
Contributor Author

astefanutti commented Feb 11, 2025

@andreyvelich @tenzen-y @varshaprasad96 please take a look when you can.

@astefanutti astefanutti force-pushed the pr-ssa branch 2 times, most recently from 8e002f8 to 465e443 Compare February 11, 2025 13:35
go.mod Outdated
@@ -19,7 +19,7 @@ require (
sigs.k8s.io/controller-runtime v0.19.1
sigs.k8s.io/jobset v0.5.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify main branch, and then let us check if it should work fine as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I assumed you preferred to wait until the upcoming JobSet release :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current trainer development is under alpha stage. So, I don't care the library stability :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti Should we merge this PR once JobSet release v0.8.0 ?
I think, we are planning to release it this week: kubernetes-sigs/jobset#774

cc @kannon92 @ahg-g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced the JobSet API version to v0.8.0-devel for now.

I'm working on adapting the unit tests but the main controller should work fine.

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Feb 13, 2025
@astefanutti astefanutti force-pushed the pr-ssa branch 4 times, most recently from 0e92a8c to 16a000a Compare February 13, 2025 09:34
@astefanutti astefanutti force-pushed the pr-ssa branch 3 times, most recently from c9ff23c to 4d7f02a Compare February 14, 2025 17:21
Signed-off-by: Antonin Stefanutti <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
My first question is could we avoid unstructured and then keep using client.Object typed?
I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Comment on lines +3 to +5
go 1.23.0

toolchain go1.23.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use these toolchains? Who is bringing this? JobSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the replace added to override the transitive dependencies coming from the newer JobSet version, go mod tidy keeps adding this. I'd need to dig into it more to understand why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, interesting. Actually, Go < 1.23.4, they enforce to specify toolchain. But after the Go 1.23.4, the restrictions are relaxed. So, IIUC, this technically could be removed.

Or, we may just bump to Kube 1.32 as library. @andreyvelich Do you want to avoid bumping kube lib to 1.32 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should try to avoid toolchain here if that is possible.
Let's bump k8s to 1.32

@astefanutti
Copy link
Contributor Author

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

@astefanutti
Copy link
Contributor Author

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

@tenzen-y
Copy link
Member

Thanks for doing this. My first question is could we avoid unstructured and then keep using client.Object typed? I was wondering if we can take a similar approach as https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L503 and https://github.com/kubernetes-sigs/kueue/blob/90ef60760b849e475f7cf32d07669bb91bbb479f/pkg/workload/workload.go#L430

Ultimately the client.Patch method converts the object into an unstructured object for SSA, because that's the only way to guarantee no default fields get introduced.

I can look at it again, but from my experience and understanding, relying on apply configuration -> unstructured is the most reliable and canonical way do to SSA.

Thank you for checking that. If you think introducing the non unstructured objects with SSA has any potential "risks", please let me know. My understanding might be incorrect for SSA

@tenzen-y
Copy link
Member

Note I still need to update the newly introduced MPI plugin to rely on apply configurations.

Yeah, absolutely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
3 participants