jnhwkim/ban-vqa

Out of memory while executing loss.backward()

cengzy14 opened this issue · 6 comments

Hello, thanks for your great code! I have some trouble while running

python3 main.py --use_both True --use_vg True

I have 4 TITAN Xps, which has 12.2G memory per GPU, and set the batchsize to 256. Then I get the following error:

nParams= 90618566
optim: adamax lr=0.0007, decay_step=2, decay_rate=0.25, grad_clip=0.25
gradual warmup lr: 0.0003
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "main.py", line 97, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home/Project/ban-vqa/train.py", line 74, in train
loss.backward()
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

And If I set batchsize to 128, it will occupy ~12G GPU memory during the early stage and then goes down to ~6G per GPU. Is there something wrong with my execution?
Thx!

Did you update NVIDIA drivers to the latest version? Some version seems to inefficiently use GPU memory. Usually it should work with 4 Titan Xs and the batch size of 256.

Thanks for your timely help!
I will update it to 410.78. Btw, what's the version of Nvidia drivers i n your machine? My machines' versions are 384.90 and 390.67, but both didn't work for me.

@jaesuny, could you provide the information about your environment in the recent reproduction for this issue? Or, any thought?

Adding del loss, pred, att and torch.cuda.empty_cache() after every iteration seems to work for me. It only occupies ~7G GPU memory while training. However, the traning speed slows down to 10000s/epoch. So i'm still looking for a better solution. Thx!

It seems that torch.cuda.empty_cache() takes most of the time. I change the 81th line of train.py into total_loss += loss.data[0] * v.data.size(0) and remove torch.cuda.empty_cache(). And now it takes ~7450s/epoch (still quite slow...) and ~12G GPU memory. Maybe it's because pytorch's inefficient GPU memory occupation.

@cengzy14 thanks for the tip!