Multi-GPU training issues #7

JunZhan2000 · 2023-10-19T08:47:50Z

Hello, thank you very much for your work. Can you give a code for multi-GPU or multi-node training?

Qiyuan-Ge · 2023-10-19T10:02:19Z

Hi. You could check the accelerate doc in hugging face

For single-GPU,
python train.py

For multi-GPU,
accelerate launch --multi_gpu train.py

JunZhan2000 · 2023-10-19T10:40:40Z

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

Thanks, I will try it. Could you help me with this issue?
#8

JunZhan2000 · 2023-10-19T17:24:29Z

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte
d in NativeBatchNormBackward0. Traceback of forward call that caused the error:
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train
real_pred = self.discr(img)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward
return model_forward(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60
, in forward
return self.model(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs) [721/13764]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa
rd
input = module(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i
mpl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa
rd
return F.batch_norm(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train
self.accelerator.backward(d_loss)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward
loss.backward(**kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Qiyuan-Ge · 2023-10-24T11:55:30Z

Hi. You could check the accelerate doc in hugging face
For single-GPU, python train.py
For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte
d in NativeBatchNormBackward0. Traceback of forward call that caused the error:
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train
real_pred = self.discr(img)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward
return model_forward(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60
, in forward
return self.model(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs) [721/13764]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa
rd
input = module(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i
mpl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa
rd
return F.batch_norm(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train
self.accelerator.backward(d_loss)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward
loss.backward(**kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Hi. I'm really sorry for replying so late.
First, I have trained on multiple GPUs before.
Second question,
Here is the reply of GPT4:
The error message you're seeing is related to PyTorch's autograd engine, which is responsible for performing the backward pass and computing gradients. The error is specifically indicating that one of the variables required for gradient computation has been modified by an inplace operation after its creation, which is not allowed in PyTorch because it interferes with the tracking of operations for gradient computation.

The error message provides a hint: "The variable in question was changed in there or anywhere later. Good luck!" This suggests that the problematic variable is being modified somewhere after its creation, either within the operation that failed to compute its gradient or somewhere later in your code.

Here are some steps you could take to troubleshoot this issue:

Search for inplace operations: In your code, search for any inplace operations that might be modifying variables after their creation. Inplace operations in PyTorch are usually denoted by an underscore at the end of the method name, like add_(), zero_(), copy_(), etc.
Disable inplace operations for debugging: As a debugging step, you could try temporarily disabling inplace operations in your code to see if the error goes away. If it does, this confirms that an inplace operation is the problem, and you can then focus on figuring out which one it is and how to avoid it.
Ensure that all operations are part of the computational graph: If you're using operations that are not part of PyTorch's computational graph, like operations from NumPy or Python's standard library, ensure that these are not modifying any PyTorch tensors in place.
Use torchviz to visualize the computation graph: The torchviz library provides a way to visualize the computation graph, which can be helpful for understanding the flow of data and operations in your model. This might help you identify where the problematic inplace operation is occurring.
Upgrade your PyTorch version: Sometimes, this kind of problem can be caused by bugs in the PyTorch autograd engine itself. If you're not using the latest version of PyTorch, consider upgrading to see if the problem goes away. Make sure to check the PyTorch release notes to see if any relevant bugs were fixed in more recent versions.

Remember when modifying your code, the goal is to ensure that any variable that's part of the computational graph is not modified in-place after its creation. If this is not possible due to the requirements of your model, you might need to rethink your model's architecture to avoid the need for inplace operations.

Qiyuan-Ge · 2023-10-24T11:58:56Z

You could also add my contact(I guess you use WeChat) if you want to keep in touch with me.

Johnathan-Xie · 2024-03-18T08:09:22Z

I also had this error and passed track_running_stats=false to the norm layer in the discriminator and it seems to run fine. However I do think this will adversely impact the model performance so ideally some other fix is found. Also I noticed there is no conversion to syncbatchnorm. Is that intentional? I may be wrong but I believe this will yield the wrong batch statistics (or at least not compute them across all gpu processes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training issues #7

Multi-GPU training issues #7

JunZhan2000 commented Oct 19, 2023

Qiyuan-Ge commented Oct 19, 2023

JunZhan2000 commented Oct 19, 2023

JunZhan2000 commented Oct 19, 2023 •

edited

Loading

Qiyuan-Ge commented Oct 24, 2023

Qiyuan-Ge commented Oct 24, 2023

Johnathan-Xie commented Mar 18, 2024

Multi-GPU training issues #7

Multi-GPU training issues #7

Comments

JunZhan2000 commented Oct 19, 2023

Qiyuan-Ge commented Oct 19, 2023

JunZhan2000 commented Oct 19, 2023

JunZhan2000 commented Oct 19, 2023 • edited Loading

Qiyuan-Ge commented Oct 24, 2023

Qiyuan-Ge commented Oct 24, 2023

Johnathan-Xie commented Mar 18, 2024

JunZhan2000 commented Oct 19, 2023 •

edited

Loading