ReturnnTrainingJob: torch multi-gpu training: port option missing #459

kuacakuaca · 2023-11-15T15:50:23Z

Currently it's not possible to run multi-gpu training jobs on the same host because of port conflict. It requires adding additional torchrun
parameters.
https://github.com/rwth-i6/i6_core/blob/e5c9e241ef67e69098a18434bb1349f2db890939/returnn/training.py#L253C28-L253C28
https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker

albertz · 2023-11-26T13:16:01Z

@Judyxujj how did you do that?

albertz · 2023-11-26T14:04:37Z

Ah, I think I misunderstood. Your issue is because multiple (independent) distributed jobs were scheduled on the same node, which caused the port conflict?

albertz · 2023-11-26T14:06:07Z

What I read is that using --standalone should also automatically do that? (In case of single node only, though.)

albertz · 2023-11-26T14:49:42Z

Now I got:

  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 198, in Engine.init_train_from_config
    line: self._ddp_pt_model = self._torch_distributed_class(
              self._pt_model, device_ids=get_device_ids(), **self._torch_distributed_options
          )
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in DistributedDataParallel.__init__
    line: _verify_param_shape_across_processes(self.process_group, parameters)
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes 
    line: return dist._verify_params_across_processes(process_group, tensors, logger)
...
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<57829> failed : Cannot assign requested address

Again not sure if this is related.

Maybe related to that:

kuacakuaca · 2023-11-27T13:09:59Z

Currently it's not possible to run multi-gpu training jobs on the same host because of port conflict.

Can you be more specific? What error do you get?

It requires adding additional torchrun parameters.

Can you be more specific? What additional parameters? In the doc you linked, it just mentions --standalone for single-node multi-worker, nothing else?

the error message was like:

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

kuacakuaca · 2023-11-27T13:12:19Z

What I read is that using --standalone should also automatically do that? (In case of single node only, though.)

yes, it works. Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

kuacakuaca · 2023-11-27T13:15:39Z

Now I got:

  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 198, in Engine.init_train_from_config
    line: self._ddp_pt_model = self._torch_distributed_class(
              self._pt_model, device_ids=get_device_ids(), **self._torch_distributed_options
          )
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 795, in DistributedDataParallel.__init__
    line: _verify_param_shape_across_processes(self.process_group, parameters)
  File "/rwthfs/rz/cluster/work/az668407/py-envs/py3.10-torch2.1/lib/python3.10/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes 
    line: return dist._verify_params_across_processes(process_group, tensors, logger)
...
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::ba59:9f03:fc:765c%ib0<57829> failed : Cannot assign requested address

Again not sure if this is related.

I haven't had this error. Did you specify something special in self._torch_distributed_options?

albertz · 2023-11-27T13:18:11Z

Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

What would $port_number be? You cannot hardcode anything here, otherwise it would get conflicts again? Or would you just set 0 and it would automatically chose any free port?

albertz · 2023-11-27T13:21:42Z

I haven't had this error. Did you specify something special in self._torch_distributed_options?

Simply torch_distributed = {}. (Btw, I now extended RETURNN that class is by default DistributedDataParallel, and options is by default {}. I wonder why we did not had those defaults before. They are reasonable, right?)

I wonder because of the Cannot assign requested address in this error, which sounds related, but not sure. I will rerun with NCCL_DEBUG=INFO and see if I get some more info.

kuacakuaca · 2023-11-27T13:30:15Z

Setting --rdzv-backend=c10d and --rdzv-endpoint=localhost:$port_number also works.

What would $port_number be? You cannot hardcode anything here, otherwise it would get conflicts again? Or would you just set 0 and it would automatically chose any free port?

yes, the documentation https://pytorch.org/docs/stable/elastic/run.html#stacked-single-node-multi-worker says:

To run multiple instances (separate jobs) of single-node, multi-worker on the same host, we need to make sure that each instance (job) is setup on different ports to avoid port conflicts (or worse, two jobs being merged as a single job). To do this you have to run with --rdzv-backend=c10d and specify a different port by setting --rdzv-endpoint=localhost:$PORT_k. For --nodes=1, its often convenient to let torchrun pick a free random port automatically instead of manually assigning different ports for each run.
torchrun
    --rdzv-backend=c10d
    --rdzv-endpoint=localhost:0
    --nnodes=1
    --nproc-per-node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

kuacakuaca · 2023-11-27T13:34:19Z

setting --standalone should be easier, but i have no idea how it would effect the distributed training performance.

albertz · 2023-11-27T13:34:57Z

setting --standalone should be easier, but i have no idea how it would effect the distributed training performance.

From the doc, I thought it would just do exactly the same?
Edit Yes, see here, it just sets those things.

albertz · 2023-11-27T13:40:09Z

So it sounds like --standalone is what we should use with --nnodes=1, right?

However, I wonder about what to do when using multiple nodes?

Fix #459

albertz changed the title ~~torch multi-gpu training: port option missing~~ ReturnnTrainingJob: torch multi-gpu training: port option missing Nov 26, 2023

This comment was marked as resolved.

Sign in to view

albertz added a commit that referenced this issue Nov 27, 2023

ReturnnTrainingJob torchrun, use --standalone for single node

211c5a2

Fix #459

This was referenced Nov 27, 2023

ReturnnTrainingJob torchrun, use --standalone for single node #462

Merged

ncclSystemError: Cannot assign requested address rwth-i6/returnn#1466

Open

albertz closed this as completed in #462 Nov 27, 2023

albertz added a commit that referenced this issue Nov 27, 2023

ReturnnTrainingJob torchrun, use --standalone for single node (#462)

3d0dd86

Fix #459

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

kuacakuaca commented Nov 15, 2023

albertz commented Nov 26, 2023

This comment was marked as resolved.

This comment was marked as resolved.

albertz commented Nov 26, 2023

albertz commented Nov 26, 2023 •

edited

Loading

albertz commented Nov 26, 2023 •

edited

Loading

kuacakuaca commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023

albertz commented Nov 27, 2023

albertz commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023 •

edited by albertz

Loading

kuacakuaca commented Nov 27, 2023

albertz commented Nov 27, 2023 •

edited

Loading

albertz commented Nov 27, 2023

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

ReturnnTrainingJob: torch multi-gpu training: port option missing #459

Comments

kuacakuaca commented Nov 15, 2023

albertz commented Nov 26, 2023

This comment was marked as resolved.

This comment was marked as resolved.

albertz commented Nov 26, 2023

albertz commented Nov 26, 2023 • edited Loading

albertz commented Nov 26, 2023 • edited Loading

kuacakuaca commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023

albertz commented Nov 27, 2023

albertz commented Nov 27, 2023

kuacakuaca commented Nov 27, 2023 • edited by albertz Loading

kuacakuaca commented Nov 27, 2023

albertz commented Nov 27, 2023 • edited Loading

albertz commented Nov 27, 2023

albertz commented Nov 26, 2023 •

edited

Loading

albertz commented Nov 26, 2023 •

edited

Loading

kuacakuaca commented Nov 27, 2023 •

edited by albertz

Loading

albertz commented Nov 27, 2023 •

edited

Loading