Training get slower with increasing epochs

Question

Training get slower with increasing epochs

zjgemi opened this issue a year ago · 3 comments

almgcu.yaml.txt

When I trained the model on my own data (input config posted above, modified based on example.yaml), I found that as the number of epochs increases, the training process becomes slower and slower. The results in metrics_epoch.csv are as follows:

Here, we can see that the first epoch took only ~1h, while the 8th epoch took ~8h. If I interrupt the training process and restart it, the first epoch after restarting takes about 1h again, but it gets slower and slower afterward. I'm not sure if I'm doing something wrong that's causing this strange behavior.

Answer 1 · 2023-10-17T16:53:59.000Z

Hi @zjgemi ,

Thanks for your interest in our code!

This is a known issue related to the underlying PyTorch: mir-group/nequip#311

Could you please report your PyTorch and CUDA versions, try the latest nequip develop, which may have a fix?
After checking that I suggest you use one of the recommended PyTorch versions from the linked issue (mir-group/nequip#311) if possible.

Answer 2 · 2023-10-18T06:32:25.000Z

My previous version was torch 2.0.0+cuda 11.7. When I switch to torch 1.11.0+cuda 11.3, the training speed increases significantly and no longer slows down as the number of epochs increases. Thank you!

Answer 3 · 2023-10-21T21:51:49.000Z

Glad to hear @zjgemi . Were you able to test whether the latest develop branch still exhibits the issue with PyTorch 2.0 on your system?