Possibility of training on 1 gpu?

Question

Possibility of training on 1 gpu?

dogydev opened this issue 5 years ago · 10 comments

Running training code causes continuous Cuda memory errors starting at around epoch 34.
My gpu:nvidia gtx 100ti .
I’ve tried to offload GCN to CPU and set batch size to 1.
Any way I could further optimize my code to prevent these errors?
Thanks.

Answer 1 · 2019-12-20T17:58:38.000Z

Solved error by decreasing batch size and max sequence length. Got around 20-30 f1 score. Any way I can improve this without needing more resources?

Thanks, again.

Answer 2 · 2019-12-21T00:02:48.000Z

can you help to share codes with it.

Answer 3 · 2019-12-21T01:34:14.000Z

Hi, @dogydev maybe you can decrease the learning rate or use accumulated gradient.

Answer 4 · 2019-12-21T08:07:24.000Z

Ok I will release codes on a fork. What should the learning rate and gradient accumulation steps be with batch size 1?
Thanks

Answer 5 · 2019-12-21T08:15:05.000Z

code: https://github.com/dogydev/CogQA

Answer 6 · 2019-12-24T07:50:53.000Z

Hi, @dogydev I am not sure about the learning rate but gradient accumulation is an alternative way to store the gradients across batches, which is aims to increase batch_size. So the setting is dependent on your target batch_size.

Answer 7 · 2019-12-27T15:38:21.000Z

Ok thanks, I modified the parameters to what I think is optimal. I will update with results after the model finishes training.
References:
https://stackoverflow.com/questions/53331540/accumulating-gradients
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20?u=alband

Parameters:
learning_rates:1e-5
batch_size:4
gradient_accumulation_steps:3
epochs:1
alpha:0.2
mode:bundle

Answer 8 · 2019-12-28T10:25:07.000Z

Could you release a finetuned model by any chance?

Answer 9 · 2019-12-29T07:27:29.000Z

Hi @dogydev , I am sorry but I am working on another project recently. We are planning to release a more flexible version of CogQA for all kinds of data, maybe a few months later.

Answer 10 · 2019-12-29T17:46:21.000Z

Ok, thank you.