torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

zhongruizhe123 · 2024-06-15T13:45:17Z

I encountered the following error while training on a single GPU:
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 14447) of binary:

I tried to adjust the training parameter: --nproc_per_node=1, but only local_rank changed here
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:

zhongruizhe123 · 2024-06-19T15:08:12Z

I have found the problem because the memory is not enough

zhongruizhe123 · 2024-06-19T15:09:29Z

I have found the problem because the memory is not enough

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

zhongruizhe123 commented Jun 15, 2024

zhongruizhe123 commented Jun 19, 2024

zhongruizhe123 commented Jun 19, 2024

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: #83

Comments

zhongruizhe123 commented Jun 15, 2024

zhongruizhe123 commented Jun 19, 2024

zhongruizhe123 commented Jun 19, 2024