Experienced system hanging?

Question

Experienced system hanging?

juliuswang0728 opened this issue 3 years ago · 1 comments

Hi!

This may be more of asking if there's similar experience than really throwing an issue.

I've been experiencing system hanging (not sure from GPU, dataloader, or any other) while finetuning a pre-trained model on, e.g. NLVR2.
It usually goes like,
(1) hangs at the beginning of the first epoch and the first iteration, which never proceeds.
(2) hangs at the iteration n, where n is some multiple of number of workers set in the starting script, and it never proceeds.

When it hangs, CPU / GPU utilization is down to zero, the system seems doing nothing.
Did you have similar experience? if so, any advice to work around it?
Thanks!

Answer 1 · 2021-11-26T20:32:45.000Z

Hi Julius,

I have never experienced this with VOLTA.

But I did have it with another repository I used, and the hanging would get better as I trained.
Not sure what might cause this though.