You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add native support for TPU multislice to JobSet. Specifically, set the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS automatically.
Why is this needed:
A common pattern in workloads running on TPU nodes involves setting the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS. This is manually done like so:
Setting these env variables automatically in JobSet would improve the user experience and make it easier to run TPU workloads.
On top of that, it would be a good opportunity to extend how JobSet counts replicas. For instance, Jobset currently adds the annotation jobset.sigs.k8s.io/global-replicas, which is a good candidate for the value of MEGASCALE_NUM_SLICES. But there are many cases in which only a fraction of jobset.sigs.k8s.io/global-replicas (the number of replicas that use TPUs) should be used for MEGASCALE_NUM_SLICES. Introducing a new API to group replicas and a new annotation like jobset.sigs.k8s.io/group-replicas would not only support the TPU use case but also other use cases that require replica grouping.
Thank you for creating this @GiuseppeTT, it looks great!
I’m curious about the Kubernetes Batch WG and JobSet maintainers' perspectives on introducing environment-specific variables (TPU, NPU, GPU) directly into JobSet ? Do we have similar examples for Batch/Job ?
Previously, we discussed designing TrainJob and TrainingRuntime APIs on top of JobSet to allow users to specify ML policies with framework-specific configurations (e.g., JAXPolicy, TorchPolicy, etc.) to improve UX.
Or these envs (MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID, MEGASCALE_COORDINATOR_ADDRESS) apply for multiple ML frameworks that use TPUs ?
What would you like to be added:
Add native support for TPU multislice to JobSet. Specifically, set the megascale env variables
MEGASCALE_NUM_SLICES
,MEGASCALE_SLICE_ID
andMEGASCALE_COORDINATOR_ADDRESS
automatically.Why is this needed:
A common pattern in workloads running on TPU nodes involves setting the megascale env variables
MEGASCALE_NUM_SLICES
,MEGASCALE_SLICE_ID
andMEGASCALE_COORDINATOR_ADDRESS
. This is manually done like so:Setting these env variables automatically in JobSet would improve the user experience and make it easier to run TPU workloads.
On top of that, it would be a good opportunity to extend how JobSet counts replicas. For instance, Jobset currently adds the annotation
jobset.sigs.k8s.io/global-replicas
, which is a good candidate for the value ofMEGASCALE_NUM_SLICES
. But there are many cases in which only a fraction ofjobset.sigs.k8s.io/global-replicas
(the number of replicas that use TPUs) should be used forMEGASCALE_NUM_SLICES
. Introducing a new API to group replicas and a new annotation likejobset.sigs.k8s.io/group-replicas
would not only support the TPU use case but also other use cases that require replica grouping.Note: A good prescedence for adding TPU specfic code is LeaderWorkerSet https://github.com/kubernetes-sigs/lws
The text was updated successfully, but these errors were encountered: