You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A main use case for Ray-on-TPU is for orchestrating JAX training jobs. XLA likes to abort the process when things go wrong, whichh ends the process outside of Python control flow and means that Ray can't tell the difference between the Ray process crashing due to some internal Ray error or XLA killing it. To help with retry logic, we have found it useful to fork a process to run the main payload function.
We're happy to keep this in our library if you don't think it's necessary. Just raising it for now.
A main use case for Ray-on-TPU is for orchestrating JAX training jobs. XLA likes to abort the process when things go wrong, whichh ends the process outside of Python control flow and means that Ray can't tell the difference between the Ray process crashing due to some internal Ray error or XLA killing it. To help with retry logic, we have found it useful to fork a process to run the main payload function.
We're happy to keep this in our library if you don't think it's necessary. Just raising it for now.
Our implementation and justification: https://github.com/stanford-crfm/levanter/blob/94afdc17f6091249e70e90cb27b4378d0553ff56/src/levanter/infra/ray_tpu.py#L459
The text was updated successfully, but these errors were encountered: