hzxie/GRNet

Error in gridding_cuda_forward: an illegal memory access was encountered

Closed this issue · 2 comments

When I use the model on my own dataset and data_loader, errors occur when training for some epoches, I tried with different pytorch and cuda version, but it still exists:

[DEBUG] 2024-05-22 00:44:33,174 Parameters in GRNet: 76713770.
/home/ubuntu/SRT/code/GRNet/models/furthestPointSampling/fps.py:23: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1712608851799/work/torch/csrc/tensor/python_tensor.cpp:78.)
  output = torch.cuda.IntTensor(B, npoint)
[INFO] 2024-05-22 00:44:35,531 [Epoch 1/150][Batch 1/111] BatchTime = 2.195 (s) DataTime = 1.787 (s) Losses = ['676.9331', '675.5205']
[INFO] 2024-05-22 00:44:36,685 [Epoch 1/150][Batch 2/111] BatchTime = 1.155 (s) DataTime = 1.047 (s) Losses = ['680.1070', '677.2239']
[INFO] 2024-05-22 00:44:38,940 [Epoch 1/150][Batch 3/111] BatchTime = 2.254 (s) DataTime = 2.094 (s) Losses = ['668.1997', '663.4119']
[INFO] 2024-05-22 00:44:40,143 [Epoch 1/150][Batch 4/111] BatchTime = 1.203 (s) DataTime = 1.101 (s) Losses = ['646.1142', '644.6344']
[INFO] 2024-05-22 00:44:41,230 [Epoch 1/150][Batch 5/111] BatchTime = 1.087 (s) DataTime = 0.977 (s) Losses = ['673.4921', '671.2493']
Error in gridding_cuda_forward: an illegal memory access was encountered
Traceback (most recent call last):
  File "runner.py", line 82, in <module>
    main()
  File "runner.py", line 64, in main
    train_net(cfg)
  File "/home/ubuntu/SRT/code/GRNet/core/train.py", line 155, in train_net
    sparse_ptcloud, dense_ptcloud = grnet(data)
  File "/home/ubuntu/anaconda3/envs/grnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/grnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/SRT/code/GRNet/models/grnet.py", line 129, in forward
    pt_features_64_c1 = self.gridding(c1).view(-1, 1, 64, 64, 64)
  File "/home/ubuntu/anaconda3/envs/grnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/grnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/SRT/code/GRNet/extensions/gridding/__init__.py", line 48, in forward
    return torch.cat(grids, dim=0).contiguous()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Is the extension 'griding' has some cuda memory problem? like @#19 (comment) #Its output tensors refuse to be readed, even if get printed, or refered to a specific GPU device.

@hzxie any suggestion will be helpful

hzxie commented

Refer to #19