How to solve the problem of experiments stalling?
YUjh0729 opened this issue · 3 comments
YUjh0729 commented
Hello,
When I train the model, the experiment stops at a certain epoch and doesn't continue training. The GPU usage is at 1% and the memory usage is 12GB, indicating that the experiment is still running. However, it stays stuck at the current epoch for an entire night, preventing the experiment from progressing. What could be the problem? Can you help explain this?
Thank you.
zcyrique commented