hzxie/GRNet

Cannot train with gridding loss!

Closed this issue · 2 comments

I replace the chamfer distance into gridding loss. No matter how big my GPU is, it will report an error after 4 or 5 iterations of the first epoch:

Traceback (most recent call last):
  File "runner.py", line 76, in <module>
    main()
  File "runner.py", line 58, in main
    train_net(cfg)
  File "/root/autodl-tmp/code/GRNet/core/train.py", line 115, in train_net
    dense_loss = gridding_loss(dense_ptcloud, data['gtcloud'])
  File "/root/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/autodl-tmp/code/GRNet/extensions/gridding_loss/__init__.py", line 107, in forward
    pred_grid, gt_grid = gdist(pred_cloud, gt_cloud)
  File "/root/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/autodl-tmp/code/GRNet/extensions/gridding_loss/__init__.py", line 89, in forward
    return torch.cat(pred_grids, dim=0).contiguous(), torch.cat(gt_grids, dim=0).contiguous()
RuntimeError: CUDA out of memory. Tried to allocate 2.34 GiB (GPU 0; 11.91 GiB total capacity; 10.18 GiB already allocated; 106.94 MiB free; 11.21 GiB reserved in total by PyTorch)

And I found that when using multi GPU training, the GPU‘s load is not uniform. The first GPU will have the maximum load, and the load of other GPUs is very small, resulting in out of memory in the first GPU.

When I use the chamfer distance, everything is OK.

GPU

My environment is Python 3.6.13, CUDA 10.1.234, PyTorch 1.6.0, 4 TITAN XP GPUs.

Hi, did you succeed training with gridding loss? I repalced chamfer distance into gridding loss, and the loss is not decreasing after that.

hzxie commented

Please first train GRNet with Chamfer Loss only, and then fine-tune with the Gridding Loss.
The GPU memory occupancy may be more stable.