Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training issues #7

Open
JunZhan2000 opened this issue Oct 19, 2023 · 6 comments
Open

Multi-GPU training issues #7

JunZhan2000 opened this issue Oct 19, 2023 · 6 comments

Comments

@JunZhan2000
Copy link

Hello, thank you very much for your work. Can you give a code for multi-GPU or multi-node training?

@Qiyuan-Ge
Copy link
Owner

Hi. You could check the accelerate doc in hugging face

For single-GPU,
python train.py

For multi-GPU,
accelerate launch --multi_gpu train.py

@JunZhan2000
Copy link
Author

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

Thanks, I will try it. Could you help me with this issue?
#8

@JunZhan2000
Copy link
Author

JunZhan2000 commented Oct 19, 2023

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte
d in NativeBatchNormBackward0. Traceback of forward call that caused the error:
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train
real_pred = self.discr(img)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward
return model_forward(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60
, in forward
return self.model(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs) [721/13764]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa
rd
input = module(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i
mpl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa
rd
return F.batch_norm(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train
self.accelerator.backward(d_loss)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward
loss.backward(**kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@Qiyuan-Ge
Copy link
Owner

Hi. You could check the accelerate doc in hugging face
For single-GPU, python train.py
For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte
d in NativeBatchNormBackward0. Traceback of forward call that caused the error:
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train
real_pred = self.discr(img)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward
return model_forward(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60
, in forward
return self.model(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs) [721/13764]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa
rd
input = module(input)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i
mpl
return forward_call(*args, **kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa
rd
return F.batch_norm(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train
self.accelerator.backward(d_loss)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward
loss.backward(**kwargs)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Hi. I'm really sorry for replying so late.
First, I have trained on multiple GPUs before.
Second question,
Here is the reply of GPT4:
The error message you're seeing is related to PyTorch's autograd engine, which is responsible for performing the backward pass and computing gradients. The error is specifically indicating that one of the variables required for gradient computation has been modified by an inplace operation after its creation, which is not allowed in PyTorch because it interferes with the tracking of operations for gradient computation.

The error message provides a hint: "The variable in question was changed in there or anywhere later. Good luck!" This suggests that the problematic variable is being modified somewhere after its creation, either within the operation that failed to compute its gradient or somewhere later in your code.

Here are some steps you could take to troubleshoot this issue:

  1. Search for inplace operations: In your code, search for any inplace operations that might be modifying variables after their creation. Inplace operations in PyTorch are usually denoted by an underscore at the end of the method name, like add_(), zero_(), copy_(), etc.

  2. Disable inplace operations for debugging: As a debugging step, you could try temporarily disabling inplace operations in your code to see if the error goes away. If it does, this confirms that an inplace operation is the problem, and you can then focus on figuring out which one it is and how to avoid it.

  3. Ensure that all operations are part of the computational graph: If you're using operations that are not part of PyTorch's computational graph, like operations from NumPy or Python's standard library, ensure that these are not modifying any PyTorch tensors in place.

  4. Use torchviz to visualize the computation graph: The torchviz library provides a way to visualize the computation graph, which can be helpful for understanding the flow of data and operations in your model. This might help you identify where the problematic inplace operation is occurring.

  5. Upgrade your PyTorch version: Sometimes, this kind of problem can be caused by bugs in the PyTorch autograd engine itself. If you're not using the latest version of PyTorch, consider upgrading to see if the problem goes away. Make sure to check the PyTorch release notes to see if any relevant bugs were fixed in more recent versions.

Remember when modifying your code, the goal is to ensure that any variable that's part of the computational graph is not modified in-place after its creation. If this is not possible due to the requirements of your model, you might need to rethink your model's architecture to avoid the need for inplace operations.

@Qiyuan-Ge
Copy link
Owner

You could also add my contact(I guess you use WeChat) if you want to keep in touch with me.

@Johnathan-Xie
Copy link

I also had this error and passed track_running_stats=false to the norm layer in the discriminator and it seems to run fine. However I do think this will adversely impact the model performance so ideally some other fix is found. Also I noticed there is no conversion to syncbatchnorm. Is that intentional? I may be wrong but I believe this will yield the wrong batch statistics (or at least not compute them across all gpu processes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants