Timeout after saving step 5000 checkpoints
Closed this issue · 4 comments
Dear authors,
I ran deepspeed training on a server with 8 A100/80GB, but the training timeout after saving step 5000 checkpoints:
[torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step5000 is ready now!
[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12029440, OpType=_ALLGATHER_BASE, NumelIn=512, NumelOut=4096, Timeout(ms)=1800000) ran for 1808499 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
I have tried some solutions from huggingface/accelerate#314, but it still fails. And I also set ddp_timeout to 5400(1.5h), it still raises errors with timeout. Should I continue to increase ddp_timeout? Or do you have other solutions?
Thank you!
I have increased the ddp_timeout to 10800(3h), but it still gets error with timeout. So I think it gets stuck here.
I think this problem is on specific GPU machine and we suggest you may change to another server to use our provided scripts.
I think the author may have run 5000 steps each time and resume. Otherwise the code will not work, please refer to issue #6 (closed), some code modifications are necessary.
This error can not be reproduced on our device.
I think the author may have run 5000 steps each time and resume. Otherwise the code will not work, please refer to issue #6 (closed), some code modifications are necessary.