Provide support for TPU multislice in JobSet #787

GiuseppeTT · 2025-02-19T17:44:28Z

What would you like to be added:

Add native support for TPU multislice to JobSet. Specifically, set the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS automatically.

Why is this needed:

A common pattern in workloads running on TPU nodes involves setting the megascale env variables MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID and MEGASCALE_COORDINATOR_ADDRESS. This is manually done like so:

env:
- name: MEGASCALE_NUM_SLICES
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/global-replicas']"
- name: MEGASCALE_SLICE_ID
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/job-global-index']"
- name: MEGASCALE_COORDINATOR_ADDRESS
  valueFrom:
    fieldRef:
      fieldPath: "metadata.annotations['jobset.sigs.k8s.io/coordinator']"

Setting these env variables automatically in JobSet would improve the user experience and make it easier to run TPU workloads.

On top of that, it would be a good opportunity to extend how JobSet counts replicas. For instance, Jobset currently adds the annotation jobset.sigs.k8s.io/global-replicas, which is a good candidate for the value of MEGASCALE_NUM_SLICES. But there are many cases in which only a fraction of jobset.sigs.k8s.io/global-replicas (the number of replicas that use TPUs) should be used for MEGASCALE_NUM_SLICES. Introducing a new API to group replicas and a new annotation like jobset.sigs.k8s.io/group-replicas would not only support the TPU use case but also other use cases that require replica grouping.

Note: A good prescedence for adding TPU specfic code is LeaderWorkerSet https://github.com/kubernetes-sigs/lws

The text was updated successfully, but these errors were encountered:

GiuseppeTT · 2025-02-19T17:45:47Z

If approved, I have a good idea on how to implement it.

kannon92 · 2025-02-19T17:52:27Z

A PR is always welcome!

I'll leave @ahg-g to comment on this. I am not a TPU user nor do I have access to that.

andreyvelich · 2025-02-19T18:38:42Z

Thank you for creating this @GiuseppeTT, it looks great!

I’m curious about the Kubernetes Batch WG and JobSet maintainers' perspectives on introducing environment-specific variables (TPU, NPU, GPU) directly into JobSet ? Do we have similar examples for Batch/Job ?

Previously, we discussed designing TrainJob and TrainingRuntime APIs on top of JobSet to allow users to specify ML policies with framework-specific configurations (e.g., JAXPolicy, TorchPolicy, etc.) to improve UX.

Or these envs (MEGASCALE_NUM_SLICES, MEGASCALE_SLICE_ID, MEGASCALE_COORDINATOR_ADDRESS) apply for multiple ML frameworks that use TPUs ?

cc @ahg-g @kannon92 @tenzen-y @astefanutti @danielvegamyhre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide support for TPU multislice in JobSet #787

Provide support for TPU multislice in JobSet #787

GiuseppeTT commented Feb 19, 2025

GiuseppeTT commented Feb 19, 2025

kannon92 commented Feb 19, 2025

andreyvelich commented Feb 19, 2025

Provide support for TPU multislice in JobSet #787

Provide support for TPU multislice in JobSet #787

Comments

GiuseppeTT commented Feb 19, 2025

GiuseppeTT commented Feb 19, 2025

kannon92 commented Feb 19, 2025

andreyvelich commented Feb 19, 2025