-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReturnnTrainingJob: torch multi-gpu training: port option missing #459
Comments
@Judyxujj how did you do that? |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Ah, I think I misunderstood. Your issue is because multiple (independent) distributed jobs were scheduled on the same node, which caused the port conflict? |
What I read is that using |
Now I got:
Again not sure if this is related. Maybe related to that:
|
the error message was like:
|
yes, it works. Setting |
I haven't had this error. Did you specify something special in |
What would |
Simply I wonder because of the |
yes, the documentation https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker says:
|
setting |
So it sounds like However, I wonder about what to do when using multiple nodes? |
Currently it's not possible to run multi-gpu training jobs on the same host because of port conflict. It requires adding additional torchrun
parameters.
https://github.com/rwth-i6/i6_core/blob/e5c9e241ef67e69098a18434bb1349f2db890939/returnn/training.py#L253C28-L253C28
https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker
The text was updated successfully, but these errors were encountered: