miyyer/scpn

Questions about training time consuming

Opened this issue · 2 comments

I try to train the scpn model but the training data is too large. I use one GPU. The batch size is 64 and for every batch I need 1.6 seconds to train. But there is 439586 batches. I try to use two GPUs to train but I fail. Could you tell me how you speed up the training process? Thank you so much. @miyyer @jwieting

hey, the time per batch looks high to me, what kind of GPU are you using?

@miyyer how long it took for you to train scpn model on PARANMT-50M dataset ( 15gb )?

because for me , it has taken two days and only half of batch has been trained from epoch 0.

done with batch 402000 / 439586 in epoch 0, loss: 1.020283, time:308 -- std output after running 3 days

below is the my device setup.

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 000075E1:00:00.0 Off |                    0 |
| N/A   67C    P0   121W / 149W |  10586MiB / 11441MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+

Thanks.