Training process stucks during training epoch

Question

Training process stucks during training epoch

merlushka opened this issue 2 years ago · 8 comments

Hi!

Thank you for your work and code!
I am trying to run training process in multi-gpu mode (ddp strategy), it starts training, but suddenly stucks in the middle of an epoch without any warnings or errors. It just stops progressing, although GPU monitor shows that gpus are working.
Depending on the number of gpus, it can happen at different steps (epoch 2, 3 or even 6), but it always stucks.
Have you experienced such a behaviour?

Answer 1 · 2023-03-29T01:08:36.000Z

Hi!

Thank you for your work and code! I am trying to run training process in multi-gpu mode (ddp strategy), it starts training, but suddenly stucks in the middle of an epoch without any warnings or errors. It just stops progressing, although GPU monitor shows that gpus are working. Depending on the number of gpus, it can happen at different steps (epoch 2, 3 or even 6), but it always stucks. Have you experienced such a behaviour?

I had the same problem as you, did you fix it?

Answer 2 · 2023-03-30T09:56:48.000Z

I attempted to run the code on the other machine with A100 GPUs and it seems to work without this stuck.
Concerning my initial tries (Tesla P40 GPUs), I have found out that the problem was in asynchronization in syncBatchNorm.
I have set
os.environ['TORCH_DISTRIBUTED_DEBUG']='DETAIL'
and got
RuntimeError: Detected mismatch between collectives on ranks.
Seems like some GPUs are still forwarding while the others start going backward. And I haven't found the reason yet.

I have turned off syncBatchNorm (set self.sync_bn = False in Former3D initialization in former_v1.py) and it started training without stucks. But I suspect, synchronized batch norm is important, and this fix might affect the training results.

If you find any correct solution, please, let me know!

UPD. The training process on the other machine (with A100 GPUs) failed with the same error, as well. But only when fine-tuning started.

Answer 3 · 2023-04-04T13:47:07.000Z

I attempted to run the code on the other machine with A100 GPUs and it seems to work without this stuck. Concerning my initial tries (Tesla P40 GPUs), I have found out that the problem was in asynchronization in syncBatchNorm. I have set os.environ['TORCH_DISTRIBUTED_DEBUG']='DETAIL' and got RuntimeError: Detected mismatch between collectives on ranks. Seems like some GPUs are still forwarding while the others start going backward. And I haven't found the reason yet.

I have turned off syncBatchNorm (set self.sync_bn = False in Former3D initialization in former_v1.py) and it started training without stucks. But I suspect, synchronized batch norm is important, and this fix might affect the training results.

If you find any correct solution, please, let me know!

UPD. The training process on the other machine (with A100 GPUs) failed with the same error, as well. But only when fine-tuning started.

I also had the same problem, are you fixed it now?

Answer 4 · 2023-04-08T08:29:42.000Z

I also had the same problem, are you fixed it now?

No, I have not fixed it yet

Answer 5 · 2023-04-27T05:55:44.000Z

I also had the same problem, are you fixed it now?

No, I have not fixed it yet

Can you reproduce the experiment result? Except for setting sync_batch=False, I use the default code and follow the official scripts. However, I got the result:
AbsRel, 0.062
AbsDiff, 0.100
SqRel, 0.042
RMSE, 0.216
LogRMSE, 0.115
r1, 0.944
r2, 0.974
r3, 0.987
complete, 0.963
comp, 0.092
acc, 0.055
prec, 0.695
recall, 0.581
fscore, 0.631

The F-Score is 7% lower than the reported performance.

Answer 6 · 2023-05-04T07:36:07.000Z

It seems the training stucks occasionally with some GPUs or some machines due to sync_batch. Change sync_batch=False could solve this but would decrease the performance a little.
Or you can try to change the random seed.
Or you can resume the training from the stuck checkpoint and continue.

Answer 7 · 2023-05-04T07:45:06.000Z

I tried to set sync_batch=False, it helps to avoid stucking, but indeed decreases the performance significantly.
And with sync_batch=True, the training stucks at very early epochs (I had maximum of 5 successive epochs without stuck), so resuming each time seems to be a doubtful solution. But it may work, thanks.

Answer 8 · 2024-02-03T13:27:33.000Z

are you fixed it now?