ma-xu/pointMLP-pytorch

CUDA out of memory

Closed this issue · 3 comments

Hi, thanks for your great work.

I'm trying to run PointMLP onto a RTX 3080 Ti, with 12 Gb of vRAM. After each stage the memory consumption of PointMLP increases a lot. I logged the sizes and memory used at each stages, this is what i got:

Stage n°1
 Input points...................:  [32, 1024, 3]
 Input features.................:  [32, 64, 1024]
 CUDA memory allocated..........:  67,613,184
 After geometric affine points..:  [32, 512, 3]
 After geometric affine features:  [32, 512, 24, 128]
 CUDA memory allocated..........:  578,729,472
 After pre extraction points....:  [32, 128, 512]
 CUDA memory allocated..........:  2,617,166,336
 After pos extraction features..:  [32, 128, 512]
 CUDA memory allocated..........:  2,684,279,296
------------------------------------------------------------
Stage n°2
 Input points...................:  [32, 512, 3]
 Input features.................:  [32, 128, 512]
 CUDA memory allocated..........:  2,684,279,296
 After geometric affine points..:  [32, 256, 3]
 After geometric affine features:  [32, 256, 24, 256]
 CUDA memory allocated..........:  3,190,775,296
 After pre extraction points....:  [32, 256, 256]
 CUDA memory allocated..........:  5,229,217,280
 After pos extraction features..:  [32, 256, 256]
 CUDA memory allocated..........:  5,296,334,336
------------------------------------------------------------
Stage n°3
 Input points...................:  [32, 256, 3]
 Input features.................:  [32, 256, 256]
 CUDA memory allocated..........:  5,296,334,336
 After geometric affine points..:  [32, 128, 3]
 After geometric affine features:  [32, 128, 24, 512]
 CUDA memory allocated..........:  5,801,241,088
 After pre extraction points....:  [32, 512, 128]
 CUDA memory allocated..........:  7,839,693,312
 After pos extraction features..:  [32, 512, 128]
 CUDA memory allocated..........:  7,906,818,560
------------------------------------------------------------
Stage n°4
 Input points...................:  [32, 128, 3]
 Input features.................:  [32, 512, 128]
 CUDA memory allocated..........:  7,906,818,560
 After geometric affine points..:  [32, 64, 3]
 After geometric affine features:  [32, 64, 24, 1024]
 CUDA memory allocated..........:  8,410,930,688

Traceback (most recent call last):
  File "original_respointMLP.py", line 389, in <module>
    out = model(data)
  File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "original_respointMLP.py", line 343, in forward
    x = self.pre_blocks_list[i](x)  # [b,d,g]
  File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "original_respointMLP.py", line 252, in forward
    x = self.operation(x)  # [b, d, k]
  File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/vedrenne/miniconda3/envs/dlgpucu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "original_respointMLP.py", line 224, in forward
    return self.act(self.net2(self.net1(x)) + x)
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.77 GiB total capacity; 9.71 GiB already allocated; 147.56 MiB free; 9.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Note that the code run just fine on CPU, and that the Elite version works on GPU.
Is this the expected behavior ? If so how much vRAM is required with a batch size of 32 ? Also, it does seem weird to have such a high consumption for a rather tiny model; how would you explain that ?

ma-xu commented

@ListIndexOutOfRange Thanks for your interest.

This is not expected behavior, and PointMLP does need large memory.
With the default setting, the peak memory will be ~19GB for pointMLP and ~4.5GB for pointMLP-elite. See the screenshots.

image

image

Note that the tensor shape in each stage would be [batch_size, number_of_selected_points, number_of_neighbors, dimension] (e.g., [32, 64, 24, 512] in last stage of PointMLP, see Fig.6). Considering the FC operation, this is not small. The residual connection also increases the memory cost.

ma-xu commented

I will close this issue since no further discussions. Feel free to reopen it if necessary.

Hi!, I see your CUDA Version is 11.4 when used the command: nvidia-smi

May I ask, which version appears when using the following.

  1. Command: nvcc --version
  2. Inside python command: torch.version.cuda

my question is because I get the following every time I try to install 'pip3 install pointnet2_ops_lib/':
'The detected CUDA version (11.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.'

Thank you!