Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

Closed
kuacakuaca opened this issue Nov 15, 2023 · 15 comments · Fixed by #462
Closed

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

kuacakuaca opened this issue Nov 15, 2023 · 15 comments · Fixed by #462

Comments

@kuacakuaca
Copy link
Contributor

Currently it's not possible to run multi-gpu training jobs on the same host because of port conflict. It requires adding additional torchrun
parameters.
https://github.com/rwth-i6/i6_core/blob/e5c9e241ef67e69098a18434bb1349f2db890939/returnn/training.py#L253C28-L253C28
https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker

@albertz
Copy link
Member

albertz commented Nov 26, 2023

@Judyxujj how did you do that?

@albertz albertz changed the title torch multi-gpu training: port option missing ReturnnTrainingJob: torch multi-gpu training: port option missing Nov 26, 2023
@albertz

This comment was marked as resolved.

@albertz

This comment was marked as resolved.

@albertz
Copy link
Member

albertz commented Nov 26, 2023

Ah, I think I misunderstood. Your issue is because multiple (independent) distributed jobs were scheduled on the same node, which caused the port conflict?

@albertz
Copy link
Member

albertz commented Nov 26, 2023

What I read is that using --standalone should also automatically do that? (In case of single node only, though.)

@albertz
Copy link
Member

albertz commented Nov 26, 2023

Now I got:

  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 198, in Engine.init_train_from_config
    line: self._ddp_pt_model = self._torch_distributed_class(
              self._pt_model, device_ids=get_device_ids(), **self._torch_distributed_options
          )
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in DistributedDataParallel.__init__
    line: _verify_param_shape_across_processes(self.process_group, parameters)
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes 
    line: return dist._verify_params_across_processes(process_group, tensors, logger)
...
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<57829> failed : Cannot assign requested address

Again not sure if this is related.

Maybe related to that:

@kuacakuaca
Copy link
Contributor Author

Currently it's not possible to run multi-gpu training jobs on the same host because of port conflict.

Can you be more specific? What error do you get?

It requires adding additional torchrun parameters.

Can you be more specific? What additional parameters? In the doc you linked, it just mentions --standalone for single-node multi-worker, nothing else?

the error message was like:

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

@kuacakuaca
Copy link
Contributor Author

What I read is that using --standalone should also automatically do that? (In case of single node only, though.)

yes, it works. Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

@kuacakuaca
Copy link
Contributor Author

Now I got:

  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 198, in Engine.init_train_from_config
    line: self._ddp_pt_model = self._torch_distributed_class(
              self._pt_model, device_ids=get_device_ids(), **self._torch_distributed_options
          )
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in DistributedDataParallel.__init__
    line: _verify_param_shape_across_processes(self.process_group, parameters)
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes 
    line: return dist._verify_params_across_processes(process_group, tensors, logger)
...
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<57829> failed : Cannot assign requested address

Again not sure if this is related.

I haven't had this error. Did you specify something special in self._torch_distributed_options?

@albertz
Copy link
Member

albertz commented Nov 27, 2023

Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

What would $port_number be? You cannot hardcode anything here, otherwise it would get conflicts again? Or would you just set 0 and it would automatically chose any free port?

@albertz
Copy link
Member

albertz commented Nov 27, 2023

I haven't had this error. Did you specify something special in self._torch_distributed_options?

Simply torch_distributed = {}. (Btw, I now extended RETURNN that class is by default DistributedDataParallel, and options is by default {}. I wonder why we did not had those defaults before. They are reasonable, right?)

I wonder because of the Cannot assign requested address in this error, which sounds related, but not sure. I will rerun with NCCL_DEBUG=INFO and see if I get some more info.

@kuacakuaca
Copy link
Contributor Author

kuacakuaca commented Nov 27, 2023

Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

What would $port_number be? You cannot hardcode anything here, otherwise it would get conflicts again? Or would you just set 0 and it would automatically chose any free port?

yes, the documentation https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker says:

To run multiple instances (separate jobs) of single-node, multi-worker on the same host, we need to make sure that each instance (job) is setup on different ports to avoid port conflicts (or worse, two jobs being merged as a single job). To do this you have to run with --rdzv-backend=c10d and specify a different port by setting --rdzv-endpoint=localhost:$PORT_k. For --nodes=1, its often convenient to let torchrun pick a free random port automatically instead of manually assigning different ports for each run.

torchrun
    --rdzv-backend=c10d
    --rdzv-endpoint=localhost:0
    --nnodes=1
    --nproc-per-node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

@kuacakuaca
Copy link
Contributor Author

setting --standalone should be easier, but i have no idea how it would effect the distributed training performance.

@albertz
Copy link
Member

albertz commented Nov 27, 2023

setting --standalone should be easier, but i have no idea how it would effect the distributed training performance.

From the doc, I thought it would just do exactly the same?
Edit Yes, see here, it just sets those things.

@albertz
Copy link
Member

albertz commented Nov 27, 2023

So it sounds like --standalone is what we should use with --nnodes=1, right?

However, I wonder about what to do when using multiple nodes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants