Out of memory while executing loss.backward()

Question

Out of memory while executing loss.backward()

cengzy14 opened this issue 6 years ago · 6 comments

Hello, thanks for your great code! I have some trouble while running

python3 main.py --use_both True --use_vg True

I have 4 TITAN Xps, which has 12.2G memory per GPU, and set the batchsize to 256. Then I get the following error:

nParams= 90618566
optim: adamax lr=0.0007, decay_step=2, decay_rate=0.25, grad_clip=0.25
gradual warmup lr: 0.0003
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "main.py", line 97, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home/Project/ban-vqa/train.py", line 74, in train
loss.backward()
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

And If I set batchsize to 128, it will occupy ~12G GPU memory during the early stage and then goes down to ~6G per GPU. Is there something wrong with my execution?
Thx!

Answer 1 · 2018-12-23T03:43:37.000Z

Did you update NVIDIA drivers to the latest version? Some version seems to inefficiently use GPU memory. Usually it should work with 4 Titan Xs and the batch size of 256.

Answer 2 · 2018-12-26T06:58:16.000Z

Thanks for your timely help!
I will update it to 410.78. Btw, what's the version of Nvidia drivers i n your machine? My machines' versions are 384.90 and 390.67, but both didn't work for me.

Answer 3 · 2018-12-26T10:01:12.000Z

@jaesuny, could you provide the information about your environment in the recent reproduction for this issue? Or, any thought?

Answer 4 · 2018-12-27T07:11:08.000Z

Adding del loss, pred, att and torch.cuda.empty_cache() after every iteration seems to work for me. It only occupies ~7G GPU memory while training. However, the traning speed slows down to 10000s/epoch. So i'm still looking for a better solution. Thx!

Answer 5 · 2018-12-27T08:36:47.000Z

It seems that torch.cuda.empty_cache() takes most of the time. I change the 81th line of train.py into total_loss += loss.data[0] * v.data.size(0) and remove torch.cuda.empty_cache(). And now it takes ~7450s/epoch (still quite slow...) and ~12G GPU memory. Maybe it's because pytorch's inefficient GPU memory occupation.

Answer 6 · 2018-12-27T11:04:29.000Z

@cengzy14 thanks for the tip!