xuyuzhuang11/OneBit

Timeout after saving step 5000 checkpoints

Closed this issue · 4 comments

Dear authors,

I ran deepspeed training on a server with 8 A100/80GB, but the training timeout after saving step 5000 checkpoints:
[torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step5000 is ready now!
[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12029440, OpType=_ALLGATHER_BASE, NumelIn=512, NumelOut=4096, Timeout(ms)=1800000) ran for 1808499 milliseconds before timing out.

[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

I have tried some solutions from huggingface/accelerate#314, but it still fails. And I also set ddp_timeout to 5400(1.5h), it still raises errors with timeout. Should I continue to increase ddp_timeout? Or do you have other solutions?

Thank you!

I have increased the ddp_timeout to 10800(3h), but it still gets error with timeout. So I think it gets stuck here.

I think this problem is on specific GPU machine and we suggest you may change to another server to use our provided scripts.

I think the author may have run 5000 steps each time and resume. Otherwise the code will not work, please refer to issue #6 (closed), some code modifications are necessary.

This error can not be reproduced on our device.

I think the author may have run 5000 steps each time and resume. Otherwise the code will not work, please refer to issue #6 (closed), some code modifications are necessary.