CUDA out of memory when testing main S3DIS dataset (segmentation) using poinnet.yaml
SantiDiazC opened this issue · 0 comments
SantiDiazC commented
Hi, Thanks for your work.
I am testing the library so I run the training using the poinnet.yaml on the S3DIS dataset (segmentation). The training went well for 100 epochs using a batch_size=2
on a RTX 3080. however, when the testing part started I found the following error:
[01/20 04:16:41 S3DIS]: Test [5]/[68] cloud
Test on 5-th cloud [20]/[72]]: 28%|████████████████████████████████████████████▍ | 20/72 [00:02<00:05, 9.00it/s]
Traceback (most recent call last):
File "examples/segmentation/main.py", line 745, in <module>
main(0, cfg)
File "examples/segmentation/main.py", line 308, in main
test_miou, test_macc, test_oa, test_ious, test_accs, _ = test(model, data_list, cfg)
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "examples/segmentation/main.py", line 598, in test
logits = model(data)
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/segmentation/base_seg.py", line 45, in forward
p, f = self.encoder.forward_seg_feat(data)
File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/backbone/pointnet.py", line 170, in forward_seg_feat
trans = self.stn(x)
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hri-david/PycharmProjects/Pointnet/PointNeXt/examples/segmentation/../../openpoints/models/backbone/pointnet.py", line 36, in forward
x = F.relu(self.bn3(self.conv3(x)))
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward
self.eps,
File "/home/hri-david/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/nn/functional.py", line 2283, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 876.00 MiB (GPU 0; 9.74 GiB total capacity; 1.28 GiB already allocated; 121.19 MiB free; 3.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: | 2.905 MB of 2.905 MB uploaded
wandb: Run history:
wandb: best_val ▁▂▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇██████████████
wandb: global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: lr ████████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb: macc_when_best ▁▂▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇███████████████
wandb: oa_when_best ▁▁███████████████▆▆▇▇▇▇▇▇▇██████████████
wandb: train_loss █▅▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train_macc ▁▃▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇███████████████
wandb: train_miou ▁▃▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb: val_macc ▃▃▇▄▅▆▃▄▇▆▅▆▄▄▇▆▆▇▆▇▅▁▇▁▆▄█▇▇▆▇▃▇▇▆▆▇▇▅▄
wandb: val_miou ▄▃▇▄▆▆▃▄▇▆▆▅▃▄▇▅▅▇▅▆▄▂▇▁▆▃█▇▆▅▆▃▇▆▅▆▇▆▅▃
wandb: val_oa ▆▅█▄▇▇▅▆▇▇▇▆▅▅▇▆▆▇▆▆▅▂▇▁▇▃█▇▇▅▇▃▇▇▆▆▇▇▆▃
wandb:
wandb: Run summary:
wandb: best_val 22.63091
wandb: global_step 100
wandb: lr 1e-05
wandb: macc_when_best 29.38019
wandb: oa_when_best 61.35135
wandb: train_loss 1.55627
wandb: train_macc 42.63173
wandb: train_miou 34.23775
wandb: val_macc 20.69266
wandb: val_miou 12.51122
wandb: val_oa 41.35226
wandb:
wandb: 🚀 View run s3dis-train-pointnet-ngpus1-20240119-195032-Y9EAMrwTdiBMMf9hkLf8 at: https://wandb.ai/dsdiazc/PointNeXt-S3DIS/runs/5cx3w4ln
wandb: ️⚡ View job at https://wandb.ai/dsdiazc/PointNeXt-S3DIS/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEzMTk0MzY1NQ==/version_details/v0
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 2 other file(s)
wandb: Find logs at: ./wandb/run-20240119_195033-5cx3w4ln/logs
Should I do some additional modification to the yaml file to make it work on my hardware (RTX 3080)?
Thank You in advance!