mir-group/allegro

Training get slower with increasing epochs

zjgemi opened this issue · 3 comments

zjgemi commented

almgcu.yaml.txt

When I trained the model on my own data (input config posted above, modified based on example.yaml), I found that as the number of epochs increases, the training process becomes slower and slower. The results in metrics_epoch.csv are as follows:

截屏2023-10-17 下午1 51 10

Here, we can see that the first epoch took only ~1h, while the 8th epoch took ~8h. If I interrupt the training process and restart it, the first epoch after restarting takes about 1h again, but it gets slower and slower afterward. I'm not sure if I'm doing something wrong that's causing this strange behavior.

Hi @zjgemi ,

Thanks for your interest in our code!

This is a known issue related to the underlying PyTorch: mir-group/nequip#311

Could you please report your PyTorch and CUDA versions, try the latest nequip develop, which may have a fix?
After checking that I suggest you use one of the recommended PyTorch versions from the linked issue (mir-group/nequip#311) if possible.

zjgemi commented

My previous version was torch 2.0.0+cuda 11.7. When I switch to torch 1.11.0+cuda 11.3, the training speed increases significantly and no longer slows down as the number of epochs increases. Thank you!

Glad to hear @zjgemi . Were you able to test whether the latest develop branch still exhibits the issue with PyTorch 2.0 on your system?