Training get slower with increasing epochs
zjgemi opened this issue · 3 comments
When I trained the model on my own data (input config posted above, modified based on example.yaml), I found that as the number of epochs increases, the training process becomes slower and slower. The results in metrics_epoch.csv are as follows:
Here, we can see that the first epoch took only ~1h, while the 8th epoch took ~8h. If I interrupt the training process and restart it, the first epoch after restarting takes about 1h again, but it gets slower and slower afterward. I'm not sure if I'm doing something wrong that's causing this strange behavior.
Hi @zjgemi ,
Thanks for your interest in our code!
This is a known issue related to the underlying PyTorch: mir-group/nequip#311
Could you please report your PyTorch and CUDA versions, try the latest nequip
develop
, which may have a fix?
After checking that I suggest you use one of the recommended PyTorch versions from the linked issue (mir-group/nequip#311) if possible.
My previous version was torch 2.0.0+cuda 11.7. When I switch to torch 1.11.0+cuda 11.3, the training speed increases significantly and no longer slows down as the number of epochs increases. Thank you!
Glad to hear @zjgemi . Were you able to test whether the latest develop
branch still exhibits the issue with PyTorch 2.0 on your system?