abnormal training time when run multi NeuRad jobs #32

szhang963 · 2024-06-11T03:11:59Z

The training time becomes longer when I run the second job in a multi-GPU cluster.

And then, the second job's training time is also slower as below.

Could you give me some suggestions?
Thank you in advance.

georghess · 2024-06-11T07:12:47Z

Have you checked that the jobs do not use the same resources (GPU, CPU)?

szhang963 · 2024-06-11T12:28:15Z

I can ensure I do not use the same GPU, but for the CPU I am not sure.
The training time is not stable when running multiple jobs in a multi-GPU cluster.

I did not meet the problem in the nerfstudio project.

Can you reproduce the issue?

georghess · 2024-06-11T13:06:16Z

I see. We often train multiple jobs in parallel on our cluster as well and have never had any issues where they affect each other. I know that the multiprocess data loading has been given some people issues, not sure if that is the case here as well?
You could try running the training with --pipeline.datamanager.num_processes=0 and see if that helps. Do you see the GPU-utilization dropping when running multiple jobs?

szhang963 · 2024-06-12T03:17:31Z

Hi, the issue was solved by setting --pipeline.datamanager.num_processes=0. At the same time, the parameter does not affect the training time in a single job for me. So, that is why? How much can it speed up for you?
Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abnormal training time when run multi NeuRad jobs #32

abnormal training time when run multi NeuRad jobs #32

szhang963 commented Jun 11, 2024

georghess commented Jun 11, 2024

szhang963 commented Jun 11, 2024 •

edited

Loading

georghess commented Jun 11, 2024

szhang963 commented Jun 12, 2024

abnormal training time when run multi NeuRad jobs #32

abnormal training time when run multi NeuRad jobs #32

Comments

szhang963 commented Jun 11, 2024

georghess commented Jun 11, 2024

szhang963 commented Jun 11, 2024 • edited Loading

georghess commented Jun 11, 2024

szhang963 commented Jun 12, 2024

szhang963 commented Jun 11, 2024 •

edited

Loading