guochengqian/PointNeXt

CUDA out of memory when testing main S3DIS dataset (segmentation) using poinnet.yaml

Opened this issue · 0 comments

Hi, Thanks for your work.

I am testing the library so I run the training using the poinnet.yaml on the S3DIS dataset (segmentation). The training went well for 100 epochs using a batch_size=2 on a RTX 3080. however, when the testing part started I found the following error:

[01/20 04:16:41 S3DIS]: Test [5]/[68] cloud
Test on 5-th cloud [20]/[72]]:  28%|████████████████████████████████████████████▍                                                                                                                   | 20/72 [00:02<00:05,  9.00it/s]
Traceback (most recent call last):
  File "examples/segmentation/main.py", line 745, in <module>
    main(0, cfg)
  File "examples/segmentation/main.py", line 308, in main
    test_miou, test_macc, test_oa, test_ious, test_accs, _ = test(model, data_list, cfg)
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "examples/segmentation/main.py", line 598, in test
    logits = model(data)
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/segmentation/base_seg.py", line 45, in forward
    p, f = self.encoder.forward_seg_feat(data)
  File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/backbone/pointnet.py", line 170, in forward_seg_feat
    trans = self.stn(x)
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/backbone/pointnet.py", line 36, in forward
    x = F.relu(self.bn3(self.conv3(x)))
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward
    self.eps,
  File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/functional.py", line 2283, in batch_norm
    input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 876.00 MiB (GPU 0; 9.74 GiB total capacity; 1.28 GiB already allocated; 121.19 MiB free; 3.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: | 2.905 MB of 2.905 MB uploaded
wandb: Run history:
wandb:       best_val ▁▂▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇██████████████
wandb:    global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:             lr ████████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb: macc_when_best ▁▂▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇███████████████
wandb:   oa_when_best ▁▁███████████████▆▆▇▇▇▇▇▇▇██████████████
wandb:     train_loss █▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:     train_macc ▁▃▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇███████████████
wandb:     train_miou ▁▃▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:       val_macc ▃▃▇▄▅▆▃▄▇▆▅▆▄▄▇▆▆▇▆▇▅▁▇▁▆▄█▇▇▆▇▃▇▇▆▆▇▇▅▄
wandb:       val_miou ▄▃▇▄▆▆▃▄▇▆▆▅▃▄▇▅▅▇▅▆▄▂▇▁▆▃█▇▆▅▆▃▇▆▅▆▇▆▅▃
wandb:         val_oa ▆▅█▄▇▇▅▆▇▇▇▆▅▅▇▆▆▇▆▆▅▂▇▁▇▃█▇▇▅▇▃▇▇▆▆▇▇▆▃
wandb: 
wandb: Run summary:
wandb:       best_val 22.63091
wandb:    global_step 100
wandb:             lr 1e-05
wandb: macc_when_best 29.38019
wandb:   oa_when_best 61.35135
wandb:     train_loss 1.55627
wandb:     train_macc 42.63173
wandb:     train_miou 34.23775
wandb:       val_macc 20.69266
wandb:       val_miou 12.51122
wandb:         val_oa 41.35226
wandb: 
wandb: 🚀 View run s3dis-train-pointnet-ngpus1-20240119-195032-Y9EAMrwTdiBMMf9hkLf8 at: https://wandb.ai/dsdiazc/PointNeXt-S3DIS/runs/5cx3w4ln
wandb: ️⚡ View job at https://wandb.ai/dsdiazc/PointNeXt-S3DIS/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEzMTk0MzY1NQ==/version_details/v0
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 2 other file(s)
wandb: Find logs at: ./wandb/run-20240119_195033-5cx3w4ln/logs

Should I do some additional modification to the yaml file to make it work on my hardware (RTX 3080)?

Thank You in advance!