training very slow on GPU
jiayeguo opened this issue · 4 comments
Hi I am trying to reproduce your results in Alanine_dipeptide_multiple_files
on a single NVIDIA GeForce GTX 1080 Ti GPU and it took ~ 5h to finish all 10 attempts. I was using tensorflow-gpu v1.9.0
, cuda/9.0 and cudnn/7.0
. As comparison, I also ran the jupyter-notebook on my laptop CPU and it was faster than GPU (~ 3h, but still very slow!). In the Nature Comm. paper, you mentioned that depending on the system, each run takes between 20s and 180s. Since I didn't change the code, I am wondering why there's such a big discrepancy in speed compared to the paper. Do you have any insight on why my training is so slow? Thanks!
Hi,
the reason for the slow speed is, that in this notebook we don't load the data into memory before training. Instead it is loaded for every batch from the hard drive. This is supposed to simulate the situation, where the whole dataset does not fit into memory. However, reading from the hard drive is slow and if you are using the GPU, it also has to be transferred to that, which I guess is the reason why it is even slower on your desktop. The time is consumed by loading and transferring data.
For the paper we simply used only one trajectory and loaded it into memory before training (see the notebook without multiple files).
Anyhow, a colleague of mine is developing a new library with the implementation of VAMPnets in Pytorch, which will be more up-to-date. I will post a link here as soon as it is released.
I hope this answers your question!
Best
Andreas
Thanks for the clarification! That makes sense. Looking forward to trying out the PyTorch version.
Best,
Jiaye
Hi Jiaye,
colleague developing the new library here. Coincidentally it is also called deeptime. If you are feeling adventurous and want to play around with it, you can find it here: https://github.com/deeptime-ml/deeptime (and documentation for vampnets in the new deeptime)
I have set up a small notebook for you demonstrating how you can use it to train vampnets. Training takes 2 - 2:30 min on my machine for 60 epochs. There are two training routines, the 2:30 one is more top-level and easier to implement, the 2:00 min one is more optimized for data that can be held in memory in their entirety.
Cheers,
Moritz
Hi Moritz! Thanks for pointing me to this new repo. I will take a look and play around with it.
Best,
Jiaye