Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide support for TPU multislice in JobSet #787

Open
GiuseppeTT opened this issue Feb 19, 2025 · 3 comments
Open

Provide support for TPU multislice in JobSet #787

GiuseppeTT opened this issue Feb 19, 2025 · 3 comments

Comments

@GiuseppeTT
Copy link
Contributor

What would you like to be added:

Add native support for TPU multislice to JobSet. Specifically, set the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS automatically.

Why is this needed:

A common pattern in workloads running on TPU nodes involves setting the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS. This is manually done like so:

env:
- name: MEGASCALE_NUM_SLICES
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/global-replicas']"
- name: MEGASCALE_SLICE_ID
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/job-global-index']"
- name: MEGASCALE_COORDINATOR_ADDRESS
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/coordinator']"

Setting these env variables automatically in JobSet would improve the user experience and make it easier to run TPU workloads.

On top of that, it would be a good opportunity to extend how JobSet counts replicas. For instance, Jobset currently adds the annotation jobset.sigs.k8s.io/global-replicas, which is a good candidate for the value of MEGASCALE_NUM_SLICES. But there are many cases in which only a fraction of jobset.sigs.k8s.io/global-replicas (the number of replicas that use TPUs) should be used for MEGASCALE_NUM_SLICES. Introducing a new API to group replicas and a new annotation like jobset.sigs.k8s.io/group-replicas would not only support the TPU use case but also other use cases that require replica grouping.

Note: A good prescedence for adding TPU specfic code is LeaderWorkerSet https://github.com/kubernetes-sigs/lws

@GiuseppeTT
Copy link
Contributor Author

If approved, I have a good idea on how to implement it.

@kannon92
Copy link
Contributor

A PR is always welcome!

I'll leave @ahg-g to comment on this. I am not a TPU user nor do I have access to that.

@andreyvelich
Copy link
Member

Thank you for creating this @GiuseppeTT, it looks great!

I’m curious about the Kubernetes Batch WG and JobSet maintainers' perspectives on introducing environment-specific variables (TPU, NPU, GPU) directly into JobSet ? Do we have similar examples for Batch/Job ?

Previously, we discussed designing TrainJob and TrainingRuntime APIs on top of JobSet to allow users to specify ML policies with framework-specific configurations (e.g., JAXPolicy, TorchPolicy, etc.) to improve UX.

Or these envs (MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID, MEGASCALE_COORDINATOR_ADDRESS) apply for multiple ML frameworks that use TPUs ?

cc @ahg-g @kannon92 @tenzen-y @astefanutti @danielvegamyhre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants