-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multigpu training hangs using single and multiple nodes #8549
Comments
Hi, Thanks for reporting this issue. Could you try |
Hello, thanks for the response. I am afraid that |
Could you try to change your code to
I only changed the shape of the Linear. Then it passed on side. Seems the XLA compiler failed for the previous shape. |
Thanks, now the script successfully runs for a single GPU, but keeps hanging for more than 1 GPU (either on the same machine or on different machines using torchrun) I tried to run with I'm seeing that I forgot to upload the traceback when cancelling the multigpu command, so I put it here
As mentioned before it seems it just simply hangs on the |
Then it should be the GPU-specific issue. If |
Yeah, it seems to be a GPU related problem. Any idea how to approach this? |
🐛 Bug
I am trying to run the example codes for distributed training but all my attempts hangs or return an error
To Reproduce
I have tried with test_train_mp_mnist and with the example in PyTorch docs https://pytorch.org/xla/master/learn/pjrt.html#tl-dr
Right now I am trying with the last one because it's simpler
Steps to reproduce the behavior:
For this example I am using a single machine with 2 GPUs.
When I try to run with 1 GPU to see if the command is correct it gives me the following error:
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=1 torchrun --nnodes=1 --nproc-per-node=1 example-xla.py --epochs 1
On the other side, if I tried to run with both GPUs to have parallelism, it just hangs. When I cancel the execution the traceback is this, it appears to hang on
torchxla.launch
torchrun --nnodes=1 --nproc-per-node=2 example-xla.py --epochs 1
Environment
CUDA 12.5
Driver 555.42.06
Additional context
I can successfully execute other non-parallel xla scripts.
When I try to use multiple nodes with torchrun it also hangs, while the same command with non-xla scripts works perfectly.
The text was updated successfully, but these errors were encountered: