ethz-asl/s2loc

Cuda out of memory issue

Closed this issue · 1 comments

Hello. I'm a student interested in this field, and I'm trying out your code.
However, there is an issue before starting training with sample data.
The output code is as below.

'''
Total size: 49
Training size: 44
Validation size: 5
Test size is 0. Configured for external tests
Starting training using 55 epochs
0%| | 0/55 [00:00<?, ?it/s]load 0.pkl.gz... done
load 0.pkl.gz... done
load 0.pkl.gz... done
load 9.pkl.gz... done
load 4.pkl.gz... done
load 10.pkl.gz... done
load 11.pkl.gz... done
load 5.pkl.gz... done
load 6.pkl.gz... done
load 12.pkl.gz... done
load 6.pkl.gz... done
load 13.pkl.gz... done
load 14.pkl.gz... done
load 3.pkl.gz... done
load 8.pkl.gz... done
0%| | 0/55 [00:03<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 288, in
train_iter = train(net, criterion, optimizer, writer, epoch, train_iter, loss_, t0)
File "train.py", line 137, in train
embedded_a, embedded_p, embedded_n = net(data1, data2, data3)
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jmw0611/Research/Deep/s2loc/script/model.py", line 94, in forward
x3 = self.convolutional(x3) # [batch, feature, beta, alpha, gamma]
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/s2cnn-1.0.0-py3.8.egg/s2cnn/soft/s2_conv.py", line 43, in forward
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/s2cnn-1.0.0-py3.8.egg/s2cnn/soft/so3_fft.py", line 465, in forward
File "/home/jmw0611/.conda/envs/deep/lib/python3.8/site-packages/s2cnn-1.0.0-py3.8.egg/s2cnn/soft/so3_fft.py", line 197, in so3_rifft
RuntimeError: CUDA out of memory. Tried to allocate 992.00 MiB (GPU 0; 9.76 GiB total capacity; 6.15 GiB already allocated; 239.38 MiB free; 7.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO - 2022-08-18 19:35:00,695 - core - signal_shutdown [atexit]
'''

Can you tell me how do I overcome this issues?
Thank you!

LBern commented

Well, basically, you ran out of GPU memory. You can:

  • Decrease batch size
  • Make a smaller model (smaller bandwidth or fewer features per layer and/or less layers)
  • Get a bigger GPU ;)