User ability to decide particular GPU for a model #84

siddanib · 2023-12-29T17:39:14Z

siddanib
Dec 29, 2023

Hello,

Thank you very much for publicly sharing this library! Could you please let me know if it is possible for the user to decide which GPU a model should utilize?

I am currently integrating FTorch with a MPI-based solver. I usually have access to a node that has 40 MPI ranks and 2 GPUs. I believe FTorch in its current form only leverages the default GPU to run the model. I may be wrong.

I was wondering if changes could be made to get_device (https://github.com/Cambridge-ICCS/FTorch/blob/main/src/ctorch.cpp#L32) based on https://discuss.pytorch.org/t/how-to-specify-cuda-device-number-in-c/54220/3 to let the user decide which GPU to leverage.

Thanks!
Bhargav

Answered by jatkinson1000

Apr 3, 2024

This request was completed in #96 by @jwallwork23 and is now in production!

View full answer

jatkinson1000 · 2023-12-29T18:32:25Z

jatkinson1000
Dec 29, 2023
Maintainer

Hi @siddanib ,

This is something we are still actively working on, but part of this will be set/handled by your job scheduling software (probably slurm) and the system/architecture. It depends slightly how things are set up as to which GPU each CPU on the node will offload to by default/specification.
I believe these factors also affect how the GPUs appear - for instance a node with 4 GPUs and 4 NUMA domains may be set up such that the GPU appears only as kCUDA to its local domain rather than each being labelled kCUDA:1-4 for all CPUs.

The first step would be to establish how the GPUs appear on your system (e.g. by accessing a compute node in an interactive session). If you can do this we can make some progress. Some information about this (assuming a fully saturated node) is here, but we are still working on this.

I will also try to look at this once we are back in the new year.
Thanks for the link - it may be that this is the right track for what we want to do.
If you experiment and make progress in the meantime do feel free to open an issue and pull request.

0 replies

siddanib · 2024-01-01T20:31:31Z

siddanib
Jan 1, 2024
Author

Hi @jatkinson1000,

Thank you very much for your detailed response and resource. The command nvidia-smi topo -m on my cluster indicated that CPUs 0-19 have affinity to GPU-0 while CPUs 20-39 have an affinity to GPU-1. I believe there are additional things, as pointed out in the resource, that I need to setup to ensure that both GPUs are leveraged, and I'll provide an update after making some progress.

I did end up experimenting with this idea of user explicitly setting the GPU device number. I have made some preliminary changes that are currently here (https://github.com/siddanib/FTorch). Please note that the changes are very crude, but some preliminary tests on my end showed positive results.

My idea was to change the device variable to c_int, where device == -1 will be CPU and 0 <= device < torch::cuda::device_count() will be a specific GPU. I have attached a screenshot where I utilize GPU:1 instead of the default GPU:0 to perform a certain computation in the integrated code.

Could you please let me know a list of things that you would like me to do before opening a pull request.

Furthermore, Are you considering of utilizing CUDA MPS (https://docs.nvidia.com/deploy/mps/index.html) to look into the N-MPI ranks trying to utilize a single GPU?

Happy New Year!
Warmest regards,
Bhargav

2 replies

jatkinson1000 Jan 1, 2024
Maintainer

Hi @siddanib this is great, thank you!

It certainly looks to be on the right track and really useful, thanks for looking at it.
I suggest you start by opening an issue describing what functionalities you want to see added. Please also include any references as above, link to this conversation, and any details of how it might be solved. You can see some examples of other issues here. This means it will then be on our radar for project updates and we can keep discussing there.

With regards to a pull request you can either make changes in your fork and open one, or if you have a working example we can work on something.
We are still working on concrete contribution guidelines, but I should point out that final changes should be made to the ftorch.fypp rather than ftorch.f90. I'd also say that eventually we'd probably want to stick with the device enum approach (albeit with more verbose tooling to set GPU number (ease of use is a big focus of the project)) but I understand that your fork is still currently in the 'experimental' stage.
We'd also want to add some clear guidance in our GPU documentation.

If you also add the MPS to the issue I'll make sure it is discussed when we return from the new year's break.
😄

siddanib Jan 2, 2024
Author

Hello @jatkinson1000, I just opened a new issue (#85 ) regarding this.

Thanks!
Bhargav

jatkinson1000 · 2024-04-03T20:09:22Z

jatkinson1000
Apr 3, 2024
Maintainer

This request was completed in #96 by @jwallwork23 and is now in production!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User ability to decide particular GPU for a model #84

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

User ability to decide particular GPU for a model #84

siddanib Dec 29, 2023

Replies: 3 comments · 2 replies

jatkinson1000 Dec 29, 2023 Maintainer

siddanib Jan 1, 2024 Author

jatkinson1000 Jan 1, 2024 Maintainer

siddanib Jan 2, 2024 Author

jatkinson1000 Apr 3, 2024 Maintainer

siddanib
Dec 29, 2023

Replies: 3 comments 2 replies

jatkinson1000
Dec 29, 2023
Maintainer

siddanib
Jan 1, 2024
Author

jatkinson1000 Jan 1, 2024
Maintainer

siddanib Jan 2, 2024
Author

jatkinson1000
Apr 3, 2024
Maintainer