CUDA out of memory
Closed this issue · 3 comments
Hi, thanks for your great work.
I'm trying to run PointMLP onto a RTX 3080 Ti, with 12 Gb of vRAM. After each stage the memory consumption of PointMLP increases a lot. I logged the sizes and memory used at each stages, this is what i got:
Stage n°1
Input points...................: [32, 1024, 3]
Input features.................: [32, 64, 1024]
CUDA memory allocated..........: 67,613,184
After geometric affine points..: [32, 512, 3]
After geometric affine features: [32, 512, 24, 128]
CUDA memory allocated..........: 578,729,472
After pre extraction points....: [32, 128, 512]
CUDA memory allocated..........: 2,617,166,336
After pos extraction features..: [32, 128, 512]
CUDA memory allocated..........: 2,684,279,296
------------------------------------------------------------
Stage n°2
Input points...................: [32, 512, 3]
Input features.................: [32, 128, 512]
CUDA memory allocated..........: 2,684,279,296
After geometric affine points..: [32, 256, 3]
After geometric affine features: [32, 256, 24, 256]
CUDA memory allocated..........: 3,190,775,296
After pre extraction points....: [32, 256, 256]
CUDA memory allocated..........: 5,229,217,280
After pos extraction features..: [32, 256, 256]
CUDA memory allocated..........: 5,296,334,336
------------------------------------------------------------
Stage n°3
Input points...................: [32, 256, 3]
Input features.................: [32, 256, 256]
CUDA memory allocated..........: 5,296,334,336
After geometric affine points..: [32, 128, 3]
After geometric affine features: [32, 128, 24, 512]
CUDA memory allocated..........: 5,801,241,088
After pre extraction points....: [32, 512, 128]
CUDA memory allocated..........: 7,839,693,312
After pos extraction features..: [32, 512, 128]
CUDA memory allocated..........: 7,906,818,560
------------------------------------------------------------
Stage n°4
Input points...................: [32, 128, 3]
Input features.................: [32, 512, 128]
CUDA memory allocated..........: 7,906,818,560
After geometric affine points..: [32, 64, 3]
After geometric affine features: [32, 64, 24, 1024]
CUDA memory allocated..........: 8,410,930,688
Traceback (most recent call last):
File "original_respointMLP.py", line 389, in <module>
out = model(data)
File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "original_respointMLP.py", line 343, in forward
x = self.pre_blocks_list[i](x) # [b,d,g]
File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "original_respointMLP.py", line 252, in forward
x = self.operation(x) # [b, d, k]
File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "original_respointMLP.py", line 224, in forward
return self.act(self.net2(self.net1(x)) + x)
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.77 GiB total capacity; 9.71 GiB already allocated; 147.56 MiB free; 9.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Note that the code run just fine on CPU, and that the Elite version works on GPU.
Is this the expected behavior ? If so how much vRAM is required with a batch size of 32 ? Also, it does seem weird to have such a high consumption for a rather tiny model; how would you explain that ?
@ListIndexOutOfRange Thanks for your interest.
This is not expected behavior, and PointMLP does need large memory.
With the default setting, the peak memory will be ~19GB for pointMLP and ~4.5GB for pointMLP-elite. See the screenshots.
Note that the tensor shape in each stage would be [batch_size, number_of_selected_points, number_of_neighbors, dimension] (e.g., [32, 64, 24, 512] in last stage of PointMLP, see Fig.6). Considering the FC operation, this is not small. The residual connection also increases the memory cost.
I will close this issue since no further discussions. Feel free to reopen it if necessary.
Hi!, I see your CUDA Version is 11.4 when used the command: nvidia-smi
May I ask, which version appears when using the following.
- Command: nvcc --version
- Inside python command: torch.version.cuda
my question is because I get the following every time I try to install 'pip3 install pointnet2_ops_lib/':
'The detected CUDA version (11.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.'
Thank you!