AkariAsai/self-rag

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:

zhongruizhe123 opened this issue · 2 comments

I encountered the following error while training on a single GPU:
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 1 (pid: 14447) of binary:

I tried to adjust the training parameter: --nproc_per_node=1, but only local_rank changed here
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary:

I have found the problem because the memory is not enough

I have found the problem because the memory is not enough