Experienced system hanging?
juliuswang0728 opened this issue · 1 comments
This may be more of asking if there's similar experience than really throwing an issue.
I've been experiencing system hanging (not sure from GPU, dataloader, or any other) while finetuning a pre-trained model on, e.g. NLVR2.
It usually goes like,
(1) hangs at the beginning of the first epoch and the first iteration, which never proceeds.
(2) hangs at the iteration n
, where n
is some multiple of number of workers
set in the starting script, and it never proceeds.
When it hangs, CPU / GPU utilization is down to zero, the system seems doing nothing.
Did you have similar experience? if so, any advice to work around it?
Hi Julius,
I have never experienced this with VOLTA.
But I did have it with another repository I used, and the hanging would get better as I trained.
Not sure what might cause this though.