From c632b0226208f7947f11bff67ae7ffbd2dd97a2b Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Mon, 17 Feb 2025 13:48:00 +0100
Subject: [PATCH 01/12] Add documentation about configuring SLURM on GH200

---
 docs/tools/slurm.md | 88 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 77 insertions(+), 11 deletions(-)
diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 757cfe8..ceef319 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -20,37 +20,103 @@ The following sections will provide detailed guidance on how to use SLURM to req
 [](){#gh200-slurm}
 ### NVIDIA GH200 GPU Nodes
 
-!!! todo
-    document how slurm can be used on the Grace-Hopper nodes.
+The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources. Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode. [Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup. Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
 
-    Note how you can link to this section from elsewhere using the anchor above, e.g.:
+The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case. Also see [TODO][TODO] for information about recommended application-specific SLURM configurations.
 
-    ```
-    [using slurm on Grace-Hopper][gh200-slurm]
-    ```
+!!! warning
+    The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes). Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously. This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
+    
+     Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
+    
+     If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script. If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
 
-Link to the [Grace-Hopper overview][gh200-node].
+[](){#gh200-slurm-single-rank-per-gpu}
+#### One rank per GPU
 
-An example of using tabs to show srun and sbatch useage to get one GPU per MPI rank:
+Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags. For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node. The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
 
 === "sbatch"
 
     ```bash
     #!/bin/bash
     #SBATCH --job-name=affinity-test
-    #SBATCH --ntasks-per-node=4
     #SBATCH --nodes=2
+    #SBATCH --ntasks-per-node=4
     #SBATCH --gpus-per-task=1
 
-    srun affinity
+    srun <application>
     ```
 
 === "srun"
 
     ```
-    > srun -n8 -N2 --gpus-per-task=1 affinity
+    srun --nodes=2 --ntasks-per-node=4 --gpus-per-task=1 <application>
     ```
+    
+Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the first GPU.
+
+[](){#gh200-slurm-multi-rank-per-gpu}
+#### Multiple ranks per GPU
+
+Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU. In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU. This is best done using MPS. To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
+
+```bash
+#!/bin/bash
+# Example mps-wrapper.sh usage:
+# > srun --cpu-bind=socket [srun args] mps-wrapper.sh [cmd] [cmd args]
+
+# Only this path is supported by MPS
+export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
+export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$(id -un)
+
+# Launch MPS from a single rank per node
+if [[ $SLURM_LOCALID -eq 0 ]]; then
+    CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d
+fi
+
+# Set CUDA device
+numa_nodes=$(hwloc-calc --physical --intersect NUMAnode $(hwloc-bind --get --taskset))
+export CUDA_VISIBLE_DEVICES=$numa_nodes
+
+# Wait for MPS to start
+sleep 1
+
+# Run the command
+numactl --membind=$numa_nodes "$@"
+result=$?
+
+# Quit MPS control daemon before exiting
+if [[ $SLURM_LOCALID -eq 0 ]]; then
+    echo quit | nvidia-cuda-mps-control
+fi
+
+exit $result
+```
+
+Save the above script as `mps-wrapper.sh` and make it executable with `chmod +x mps-wrapper.sh`. If the `mps-wrapper.sh` script is in the current working directory, you can then launch jobs using MPS for example as follows:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=oversubscription-affinity-test
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=32
+#SBATCH --cpus-per-task=8
+
+export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+
+srun --cpu-bind=socket ./mps-wrapper.sh <application>
+```
+
+Note that in the example job above:
+
+- `--gpus-per-node` is not set at all; the `mps-wrapper.sh` script ensures that the right GPU is visible for each rank using `CUDA_VISIBLE_DEVICES`
+- `--ntasks-per-node` is set to 32; this results in 8 ranks per GPU
+- `--cpus-per-task` is set to 8; this ensures that the CPU mask is set appropriately for each rank
+- `OMP_NUM_THREADS` is exported for applications that use OpenMP; this may not be needed for your application, or you may need other libraries to be configured to use the correct number of threads
+- `--cpu-bind=socket` is set on the `srun` command; this will expose a full CPU for each rank, allowing threads to migrate between cores within the socket, but not across sockets
 
+The configuration that is optimal for your application may be different.
 
 [](){#amdcpu-slurm}
 ## AMD CPU

From 33968fafe55f77cf70923d6371934a5798010393 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Mon, 17 Feb 2025 15:03:51 +0100
Subject: [PATCH 02/12] Split up paragraphs to one sentence per line in
 slurm/GH200 section

---
 docs/tools/slurm.md | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index ceef319..df7d007 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -20,21 +20,30 @@ The following sections will provide detailed guidance on how to use SLURM to req
 [](){#gh200-slurm}
 ### NVIDIA GH200 GPU Nodes
 
-The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources. Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode. [Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup. Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
+The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
+Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
+[Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup.
+Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
 
-The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case. Also see [TODO][TODO] for information about recommended application-specific SLURM configurations.
+The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
+Also see [TODO][TODO] for information about recommended application-specific SLURM configurations.
 
 !!! warning
-    The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes). Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously. This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
+    The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
+    Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
+    This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
     
      Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
     
-     If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script. If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
+     If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
+     If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
 
 [](){#gh200-slurm-single-rank-per-gpu}
 #### One rank per GPU
 
-Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags. For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node. The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
+Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
+For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
+The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
 
 === "sbatch"
 
@@ -59,7 +68,10 @@ Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the
 [](){#gh200-slurm-multi-rank-per-gpu}
 #### Multiple ranks per GPU
 
-Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU. In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU. This is best done using MPS. To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
+Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
+In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
+This is best done using MPS.
+To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
 
 ```bash
 #!/bin/bash
@@ -94,7 +106,8 @@ fi
 exit $result
 ```
 
-Save the above script as `mps-wrapper.sh` and make it executable with `chmod +x mps-wrapper.sh`. If the `mps-wrapper.sh` script is in the current working directory, you can then launch jobs using MPS for example as follows:
+Save the above script as `mps-wrapper.sh` and make it executable with `chmod +x mps-wrapper.sh`.
+If the `mps-wrapper.sh` script is in the current working directory, you can then launch jobs using MPS for example as follows:
 
 ```bash
 #!/bin/bash

From 4192fe4c05bd828e3a9cb70b6d7533e6318d69fd Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Mon, 17 Feb 2025 17:21:05 +0100
Subject: [PATCH 03/12] Update link to scientific applications in GH200/SLURM
 section

---
 docs/tools/slurm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index df7d007..0bafbb5 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -26,7 +26,7 @@ Applications that can saturate the GPUs with a single process per GPU should gen
 Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
 
 The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
-Also see [TODO][TODO] for information about recommended application-specific SLURM configurations.
+Also see [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
 
 !!! warning
     The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).

From edb4c10252a3aebf27e0132b2dd20f0021681631 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:10:53 +0100
Subject: [PATCH 04/12] Remove unnecessary word from GH200 slurm docs

---
 docs/tools/slurm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 0bafbb5..66cc998 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -26,7 +26,7 @@ Applications that can saturate the GPUs with a single process per GPU should gen
 Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
 
 The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
-Also see [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
+See [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
 
 !!! warning
     The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).

From 497f83e4bd4eb97849b7420290ee014339bba477 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:11:10 +0100
Subject: [PATCH 05/12] Fix indentation in GH200 slurm docs

---
 docs/tools/slurm.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 66cc998..9a1ac89 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -33,10 +33,10 @@ See [Scientific Applications][sciapps] for information about recommended applica
     Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
     This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
     
-     Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
+    Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
     
-     If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
-     If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
+    If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
+    If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
 
 [](){#gh200-slurm-single-rank-per-gpu}
 #### One rank per GPU

From 092fb690f73a3660a3d3fb1e327122d979605628 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:15:49 +0100
Subject: [PATCH 06/12] Remove superfluous srun example from GH200 slurm docs

---
 docs/tools/slurm.md | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 9a1ac89..32c5f85 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -45,23 +45,15 @@ Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--nta
 For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
 The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
 
-=== "sbatch"
-
-    ```bash
-    #!/bin/bash
-    #SBATCH --job-name=affinity-test
-    #SBATCH --nodes=2
-    #SBATCH --ntasks-per-node=4
-    #SBATCH --gpus-per-task=1
-
-    srun <application>
-    ```
-
-=== "srun"
+```bash
+#!/bin/bash
+#SBATCH --job-name=affinity-test
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=4
+#SBATCH --gpus-per-task=1
 
-    ```
-    srun --nodes=2 --ntasks-per-node=4 --gpus-per-task=1 <application>
-    ```
+srun <application>
+```
     
 Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the first GPU.
 

From 27926d00c5f0e4391190261ef5d4945441308042 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:17:35 +0100
Subject: [PATCH 07/12] Update slurm job names in GH200 slurm docs

---
 docs/tools/slurm.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 32c5f85..f1349f3 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -47,7 +47,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat
 
 ```bash
 #!/bin/bash
-#SBATCH --job-name=affinity-test
+#SBATCH --job-name=gh200-single-rank-per-gpu
 #SBATCH --nodes=2
 #SBATCH --ntasks-per-node=4
 #SBATCH --gpus-per-task=1
@@ -103,7 +103,7 @@ If the `mps-wrapper.sh` script is in the current working directory, you can then
 
 ```bash
 #!/bin/bash
-#SBATCH --job-name=oversubscription-affinity-test
+#SBATCH --job-name=gh200-multiple-ranks-per-gpu
 #SBATCH --nodes=2
 #SBATCH --ntasks-per-node=32
 #SBATCH --cpus-per-task=8

From 4f43fdc38054b806cb72fee507ca00ed0259c059 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:22:35 +0100
Subject: [PATCH 08/12] Add another MPS link

---
 docs/tools/slurm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index f1349f3..2b25afc 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -62,7 +62,7 @@ Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the
 
 Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
 In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
-This is best done using MPS.
+This is best done using [MPS](https://docs.nvidia.com/deploy/mps/index.html).
 To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
 
 ```bash

From bf0c26c48610181143a7d6a65e843c40b6923c1d Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 10:29:37 +0100
Subject: [PATCH 09/12] Update MPS links in GH200 slurm docs

---
 docs/tools/slurm.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 2b25afc..3de6852 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -23,7 +23,7 @@ The following sections will provide detailed guidance on how to use SLURM to req
 The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
 Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
 [Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup.
-Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
+Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
 
 The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
 See [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
@@ -62,7 +62,7 @@ Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the
 
 Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
 In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
-This is best done using [MPS](https://docs.nvidia.com/deploy/mps/index.html).
+This is best done using [NVIDIA's Multi-Process Service (MPS)].
 To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
 
 ```bash
@@ -123,6 +123,8 @@ Note that in the example job above:
 
 The configuration that is optimal for your application may be different.
 
+[NVIDIA's Multi-Process Service (MPS)]: https://docs.nvidia.com/deploy/mps/index.html
+
 [](){#amdcpu-slurm}
 ## AMD CPU
 

From 2add4041c818cbf356114fadf3e8b3d9dbac22fb Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 11:07:01 +0100
Subject: [PATCH 10/12] Add short motivation for "default" compute mode on
 GH200

---
 docs/tools/slurm.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 3de6852..18c0d8f 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -30,6 +30,7 @@ See [Scientific Applications][sciapps] for information about recommended applica
 
 !!! warning
     The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
+    The "default" mode is used to avoid issues with certain containers.
     Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
     This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
     

From 40d40fcd240cdb2966e263bcc44bea5adc0b9d02 Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 11:10:53 +0100
Subject: [PATCH 11/12] Clarify consequences of omitting --gpus-per-task on
 GH200

---
 docs/tools/slurm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index 18c0d8f..e996ee3 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -56,7 +56,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat
 srun <application>
 ```
     
-Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the first GPU.
+Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
 
 [](){#gh200-slurm-multi-rank-per-gpu}
 #### Multiple ranks per GPU

From b350064cb59c44cf556c5d29b6169b02112dec1e Mon Sep 17 00:00:00 2001
From: Mikael Simberg <mikael.simberg@iki.fi>
Date: Wed, 19 Feb 2025 11:19:43 +0100
Subject: [PATCH 12/12] Remove CPU binding mentions for now from GH200 slurm
 section

---
 docs/tools/slurm.md | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md
index e996ee3..065401f 100644
--- a/docs/tools/slurm.md
+++ b/docs/tools/slurm.md
@@ -69,7 +69,7 @@ To use MPS, launch your application using the following wrapper script, which wi
 ```bash
 #!/bin/bash
 # Example mps-wrapper.sh usage:
-# > srun --cpu-bind=socket [srun args] mps-wrapper.sh [cmd] [cmd args]
+# > srun [srun args] mps-wrapper.sh [cmd] [cmd args]
 
 # Only this path is supported by MPS
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
@@ -109,18 +109,14 @@ If the `mps-wrapper.sh` script is in the current working directory, you can then
 #SBATCH --ntasks-per-node=32
 #SBATCH --cpus-per-task=8
 
-export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
-
-srun --cpu-bind=socket ./mps-wrapper.sh <application>
+srun ./mps-wrapper.sh <application>
 ```
 
 Note that in the example job above:
 
 - `--gpus-per-node` is not set at all; the `mps-wrapper.sh` script ensures that the right GPU is visible for each rank using `CUDA_VISIBLE_DEVICES`
 - `--ntasks-per-node` is set to 32; this results in 8 ranks per GPU
-- `--cpus-per-task` is set to 8; this ensures that the CPU mask is set appropriately for each rank
-- `OMP_NUM_THREADS` is exported for applications that use OpenMP; this may not be needed for your application, or you may need other libraries to be configured to use the correct number of threads
-- `--cpu-bind=socket` is set on the `srun` command; this will expose a full CPU for each rank, allowing threads to migrate between cores within the socket, but not across sockets
+- `--cpus-per-task` is set to 8; this ensures that threads are not allowed to migrate across the whole GH200 node
 
 The configuration that is optimal for your application may be different.