Multi-Node Training with RunAI

Note

Multi-Node scheduling needs to be enabled on the cluster and you should be using a RunAI CLI which supports multi-node jobs.

Caution

This doc explains an advanced usage of RunAI.

Jobs can be submitted either through RunAI as documented in RunAI's website (https://docs.run.ai/v2.13/Researcher/cli-reference/runai-submit-dist-pytorch/).

As an example, the following command launches 3 pods, each with 4 GPUs. Note that the number of pods is one more than the number of workers as the master node is not counted as a worker.

runai submit-dist pytorch \
    --name distributed-job-readme \
    --workers=2 -g 4 -i ic-registry.epfl.ch/mlo/lauzhack:v1 \
    --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce \
    --extended-resource rdma/rdma=1 \
    -- "sleep infinity"

Note that it is not possbile to control how these pods are scheduled so these two pods can be either on the same node or on different nodes. For best performance, local GPUs should be maximized, which would mean asking for pods of 8 GPUs each (taking a full node).

RunAI handles scheduling the pods and also creates the necessary communication (rendezvous) backend (most likely c10d) between them. The following environment variables are set:

WORLD_SIZE: Number of pods (number of GPUs in each pod does not matter.)
RANK: Rank of the pod (number of GPUs in each pod does not matter.)
MASTER_ADDR: IP Address of the master node.
MASTER_PORT: Port on which master node is listening

For running a training job, torchrun accepts the above variables as arguments and automatically schedules the job. For example the following command can be used to schedule a training job on the 3 pods we launched before. Note that the command needs to be run on each of the pods separately.

torchrun \
    --nproc-per-node 4 \
    --nnodes ${WORLD_SIZE} \
    --node_rank ${RANK} \
    --master_addr ${MASTER_ADDR} \
    --master_port ${MASTER_PORT} \
    main.py

torchrun automatically launches a separate process for each GPU and assigns the correct global rank. As such, for basic usage (e.g. FSDP), no changes to python code is necessary.

Using RDMA for efficient inter-node communication

While the above should get a job running, additional setup is necessary for efficient communication, in particular, using RDMA. We have already specified the following flags when running our pods to ensure RDMA support: --annotation k8s.v1.cni.cncf.io/networks=kube-system/roce --extended-resource rdma/rdma=1.

However, the communication backend requires additional configuration to use RDMA. In particular, the following steps are needed when using NCCL. The necessary steps may vary for different OS distributions or versions as well as when alternative drivers for Inifiniband/RDMA are installed.

Determine the device name: Usually there should be a single directory in /sys/class/infiniband. The name of the folder, is the name of the registered RDMA device. In my settings, the infiniband device was registered as mlx5_bond_0.
Determine the correct GID index for RDMA:

2.1. In RPC, both IPv4 and IPv6 are available in the pod. Therefore there are two indices for each connection type, one for IPv4 and one for IPv6. For finding the one that uses IPv4, we can use the following command:
```
grep '0000:0000:0000:0000:0000:ffff' /sys/class/infiniband/mlx5_bond_0/ports/1/gids/*
```
2.2. It is also possible to either use RoCE v2 or RoCE v1. For example to find the indices corresponding to RoCE v2 we can use:
```
grep 'RoCE v2' /sys/class/infiniband/mlx5_bond_0/ports/1/gid_attrs/types/* 2>/dev/null
```
2.3. The port that appears in both of the above commands is the one we want. For the pods I was running this was always port 9. The following one-liner performs all the above operations:
```
grep 'RoCE v2' $(grep '0000:0000:0000:0000:0000:ffff' /sys/class/infiniband/mlx5_bond_0/ports/1/gids/* | cut -d ':' -f 1 | sed 's/gids/gid_attrs\/types/') |  sed -e 's/.*\/$[0-9]*$:.*/\1/'
```

Once we know the device name as well as the correct GID index, we can configure NCCL by settings the environment variable NCCL_IB_GID_INDEX to the desired GID index. Furthermore, we should set NCCL_IB_HCA to a prefix to ensure NCCL uses the right device. For example, either mlx5_bond_0 or a prefix such as mlx5 should work.

export NCCL_IB_GID_INDEX=$(grep 'RoCE v2' $(grep '0000:0000:0000:0000:0000:ffff' /sys/class/infiniband/mlx5_bond_0/ports/1/gids/* | cut -d ':' -f 1 | sed 's/gids/gid_attrs\/types/') |  sed -e 's/.*\/\([0-9]*\):.*/\1/')
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_NTHREADS=4 
export NCCL_NSOCKS_PERTHREAD=8

You should run torchrun with the above environment variables set. This should usually be enough to get NCCL to correctly use RDMA. To verify this, you can use tools such as ifstats. These tools monitor network traffic that goes through CPU. When using RDMA, no such traffic should be visible (assuming you are not using the network interface for other things).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multinode.md

multinode.md

Multi-Node Training with RunAI

Using RDMA for efficient inter-node communication

Files

multinode.md

Latest commit

History

multinode.md

File metadata and controls

Multi-Node Training with RunAI

Using RDMA for efficient inter-node communication