Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Initialized and ComponentsCreated conditions to TrainJob API #2464

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dineshkolhe1
Copy link

This PR adds Initialized and ComponentsCreated conditions to the TrainJob API to better track the state of job creation and initialization.

  • Added Initialized condition to indicate when the TrainJob has been initialized.
  • Added ComponentsCreated condition to indicate when the components (e.g., JobSet, Jobs) have been created.
  • Updated the controller logic to set these conditions during reconciliation.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested a review from jinchihe March 1, 2025 16:57
@dineshkolhe1
Copy link
Author

@ tenzen-y
can you check my PR where I added some feature .

// TrainJobCreated means that the actual jobs creation has succeeded.
TrainJobCreated string = "Created"
// TrainJobInitialized means the TrainJob has been initialized.
TrainJobInitialized string = "Initialized"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @dineshkolhe1!
Let's discuss whether these conditions make sense given that we also have custom dataset and model Initializers in the TrainJob.
/hold

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously ! it make sense
the condition status -
Created – The job has been submitted but not yet started. (its make difficult to understand)
but ,
thise condition bit easily understandable -
Initialized – Indicates that the TrainJob request has been accepted and validated. &
ComponentsCreated – Confirms that all necessary components have been successfully created. (e.g., pods, PVCs, ).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Created status may be too misleading.
Initialized and ComponentsCreated would offer better clarity on the state of the training job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that we also support custom Initializers to fetch data and pre-trained model before running distributed training: https://github.com/kubeflow/trainer/tree/master/pkg/initializer

@kubeflow/wg-training-leads @Electronic-Waste @astefanutti @seanlaii @saileshd1402 @akshaychitneni @shravan-achar Any thoughts on how to make TrainJob conditions less confusing for users ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about instead of "Initialized" and "ComponentsCreated".
We use "Created" and "Initialized"/"Active"?

I think this is more clear to users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation is to replace Created with ComponentsCreated. For Active, this really should not be introduced since the active state is not a stable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants