Possibility of training on 1 gpu?
dogydev opened this issue · 10 comments
Running training code causes continuous Cuda memory errors starting at around epoch 34.
My gpu:nvidia gtx 100ti .
I’ve tried to offload GCN to CPU and set batch size to 1.
Any way I could further optimize my code to prevent these errors?
Thanks.
Solved error by decreasing batch size and max sequence length. Got around 20-30 f1 score. Any way I can improve this without needing more resources?
Thanks, again.
can you help to share codes with it.
Hi, @dogydev maybe you can decrease the learning rate or use accumulated gradient.
Ok I will release codes on a fork. What should the learning rate and gradient accumulation steps be with batch size 1?
Thanks
Hi, @dogydev I am not sure about the learning rate but gradient accumulation is an alternative way to store the gradients across batches, which is aims to increase batch_size. So the setting is dependent on your target batch_size.
Ok thanks, I modified the parameters to what I think is optimal. I will update with results after the model finishes training.
References:
https://stackoverflow.com/questions/53331540/accumulating-gradients
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20?u=alband
Parameters:
learning_rates:1e-5
batch_size:4
gradient_accumulation_steps:3
epochs:1
alpha:0.2
mode:bundle
Could you release a finetuned model by any chance?
Hi @dogydev , I am sorry but I am working on another project recently. We are planning to release a more flexible version of CogQA for all kinds of data, maybe a few months later.
Ok, thank you.