-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Initialized and ComponentsCreated conditions to TrainJob API #2464
base: master
Are you sure you want to change the base?
Add Initialized and ComponentsCreated conditions to TrainJob API #2464
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ tenzen-y |
// TrainJobCreated means that the actual jobs creation has succeeded. | ||
TrainJobCreated string = "Created" | ||
// TrainJobInitialized means the TrainJob has been initialized. | ||
TrainJobInitialized string = "Initialized" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @dineshkolhe1!
Let's discuss whether these conditions make sense given that we also have custom dataset and model Initializers in the TrainJob.
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously ! it make sense
the condition status -
Created – The job has been submitted but not yet started. (its make difficult to understand)
but ,
thise condition bit easily understandable -
Initialized – Indicates that the TrainJob request has been accepted and validated. &
ComponentsCreated – Confirms that all necessary components have been successfully created. (e.g., pods, PVCs, ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current Created status may be too misleading.
Initialized and ComponentsCreated would offer better clarity on the state of the training job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is that we also support custom Initializers to fetch data and pre-trained model before running distributed training: https://github.com/kubeflow/trainer/tree/master/pkg/initializer
@kubeflow/wg-training-leads @Electronic-Waste @astefanutti @seanlaii @saileshd1402 @akshaychitneni @shravan-achar Any thoughts on how to make TrainJob conditions less confusing for users ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about instead of "Initialized" and "ComponentsCreated".
We use "Created" and "Initialized"/"Active"?
I think this is more clear to users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation is to replace Created
with ComponentsCreated
. For Active
, this really should not be introduced since the active state is not a stable state.
This PR adds
Initialized
andComponentsCreated
conditions to theTrainJob
API to better track the state of job creation and initialization.Initialized
condition to indicate when the TrainJob has been initialized.ComponentsCreated
condition to indicate when the components (e.g., JobSet, Jobs) have been created.