Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe fork a separate process to monitor execution #13

Open
dlwh opened this issue Feb 26, 2025 · 0 comments
Open

Maybe fork a separate process to monitor execution #13

dlwh opened this issue Feb 26, 2025 · 0 comments

Comments

@dlwh
Copy link

dlwh commented Feb 26, 2025

A main use case for Ray-on-TPU is for orchestrating JAX training jobs. XLA likes to abort the process when things go wrong, whichh ends the process outside of Python control flow and means that Ray can't tell the difference between the Ray process crashing due to some internal Ray error or XLA killing it. To help with retry logic, we have found it useful to fork a process to run the main payload function.

We're happy to keep this in our library if you don't think it's necessary. Just raising it for now.

Our implementation and justification: https://github.com/stanford-crfm/levanter/blob/94afdc17f6091249e70e90cb27b4378d0553ff56/src/levanter/infra/ray_tpu.py#L459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant