torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:
zhongruizhe123 opened this issue · 2 comments
zhongruizhe123 commented
I encountered the following error while training on a single GPU:
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 14447) of binary:
I tried to adjust the training parameter: --nproc_per_node=1, but only local_rank changed here
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:
zhongruizhe123 commented
I have found the problem because the memory is not enough
zhongruizhe123 commented
I have found the problem because the memory is not enough