-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training issues #7
Comments
Hi. You could check the accelerate doc in hugging face For single-GPU, For multi-GPU, |
Thanks, I will try it. Could you help me with this issue? |
hi, I used the following command to train on 4 GPUs.
At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?
|
Hi. I'm really sorry for replying so late. The error message provides a hint: "The variable in question was changed in there or anywhere later. Good luck!" This suggests that the problematic variable is being modified somewhere after its creation, either within the operation that failed to compute its gradient or somewhere later in your code. Here are some steps you could take to troubleshoot this issue:
Remember when modifying your code, the goal is to ensure that any variable that's part of the computational graph is not modified in-place after its creation. If this is not possible due to the requirements of your model, you might need to rethink your model's architecture to avoid the need for inplace operations. |
You could also add my contact(I guess you use WeChat) if you want to keep in touch with me. |
I also had this error and passed track_running_stats=false to the norm layer in the discriminator and it seems to run fine. However I do think this will adversely impact the model performance so ideally some other fix is found. Also I noticed there is no conversion to syncbatchnorm. Is that intentional? I may be wrong but I believe this will yield the wrong batch statistics (or at least not compute them across all gpu processes). |
Hello, thank you very much for your work. Can you give a code for multi-GPU or multi-node training?
The text was updated successfully, but these errors were encountered: