dvlab-research/PointGroup

RuntimeError: CUDA error: an illegal memory access was encountered

Abastro opened this issue · 1 comments

I keep getting into this problem, any ideas? It seems like it hits the wall every time.
Illegal memory access happens again when I do
These are the dependencies I have:

numpy        1.20.2
PG-OP        0.0.0
Pillow       8.2.0
pip          21.0.1
plyfile      0.7.4
protobuf     3.16.0
PyYAML       5.4.1
scipy        1.6.3
setuptools   52.0.0.post20210125
six          1.16.0
spconv       1.0
tensorboardX 2.2
torch        1.1.0
torchvision  0.3.0
wheel        0.36.2

With CUDA 10.2 and cudnn 7.6.5.
I was met with the following error.

/home/ubuntu/pointgroup-hs/PointGroup/util/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
[2021-05-08 11:11:43,432  INFO  log.py  line 40  14354]  ************************ Start Logging ************************
[2021-05-08 11:11:43,471  INFO  train.py  line 26  14354]  Namespace(TEST_NMS_THRESH=0.3, TEST_NPOINT_THRESH=100, TEST_SCORE_THRESH=0.09, batch_size=4, bg_thresh=0.25, block_reps=2, block_residual=True, classes=20, cluster_meanActive=50, cluster_npoint_thre=50, cluster_radius=0.03, cluster_shift_meanActive=300, config='config/pointgroup_run1_scannet.yaml', data_root='dataset', dataset='scannetv2', dataset_dir='data/scannetv2_inst.py', epochs=384, eval=True, exp_path='exp/scannetv2/pointgroup/pointgroup_run1_scannet', fg_thresh=0.75, filename_suffix='_inst_nostuff.pth', fix_module=[], full_scale=[128, 512], ignore_label=-100, input_channel=3, loss_weight=[1.0, 1.0, 1.0, 1.0], lr=0.001, m=16, manual_seed=123, max_npoint=250000, mode=4, model_dir='model/pointgroup/pointgroup.py', model_name='pointgroup', momentum=0.9, multiplier=0.5, optim='Adam', prepare_epochs=128, pretrain='', pretrain_module=[], pretrain_path=None, save_freq=16, save_instance=False, save_pt_offsets=False, save_semantic=False, scale=50, score_fullscale=14, score_mode=4, score_scale=50, split='val', step_epoch=384, task='train', test_epoch=384, test_seed=567, test_workers=16, train_workers=16, use_coords=True, weight_decay=0.0001)
[2021-05-08 11:11:43,478  INFO  train.py  line 135  14354]  => creating model ...
[2021-05-08 11:11:43,610  INFO  train.py  line 147  14354]  cuda available: True
[2021-05-08 11:11:46,651  INFO  train.py  line 152  14354]  #classifier parameters: 7715016
[2021-05-08 11:12:34,348  INFO  scannetv2_inst.py  line 43  14354]  Training samples: 1201
[2021-05-08 11:12:46,605  INFO  scannetv2_inst.py  line 54  14354]  Validation samples: 312
[2021-05-08 11:12:46,665  INFO  utils.py  line 61  14354]  Restore from exp/scannetv2/pointgroup/pointgroup_run1_scannet/pointgroup_run1_scannet-000000001.pth
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "train.py", line 179, in <module>
    train_epoch(dataset.train_data_loader, model, model_fn, optimizer, epoch)
  File "train.py", line 54, in train_epoch
    loss, _, visual_dict, meter_dict = model_fn(batch, model, epoch)
  File "/home/ubuntu/pointgroup-hs/PointGroup/model/pointgroup/pointgroup.py", line 398, in model_fn
    ret = model(input_, p2v_map, coords_float, coords[:, 0].int(), batch_offsets, epoch)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/pointgroup-hs/PointGroup/model/pointgroup/pointgroup.py", line 264, in forward
    output = self.input_conv(input)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/modules.py", line 123, in forward
    input = module(input)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/conv.py", line 157, in forward
    outids.shape[0])
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/functional.py", line 83, in forward
    return ops.indice_conv(features, filters, indice_pairs, indice_pair_num, num_activate_out, False, True)
  File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/ops.py", line 112, in indice_conv
    int(inverse), int(subm))
RuntimeError: CUDA error: an illegal memory access was encountered (copy_to_cpu at /pytorch/aten/src/ATen/native/cuda/Copy.cu:199)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7ff926443441 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7ff926442d7a in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: (anonymous namespace)::copy_to_cpu(at::Tensor&, at::Tensor const&) + 0xa45 (0x7ff8c41b2a65 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: void (anonymous namespace)::_copy__cuda<int>(at::Tensor&, at::Tensor const&, bool) + 0x5ae (0x7ff8c425335e in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0x378 (0x7ff8c41b45d8 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #5: at::native::_s_copy_from_cuda(at::Tensor const&, at::Tensor const&, bool) + 0x32 (0x7ff8c41b4c62 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #6: at::CUDAType::_s_copy_from(at::Tensor const&, at::Tensor const&, bool) const + 0xdd (0x7ff8c30bc78d in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #7: at::native::_s_copy__cpu(at::Tensor&, at::Tensor const&, bool) + 0x5f (0x7ff8b8003e6f in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #8: <unknown function> + 0xb8cb9f (0x7ff8b82c5b9f in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x26d (0x7ff8b800333d in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #10: torch::autograd::VariableType::copy_(at::Tensor&, at::Tensor const&, bool) const + 0x629 (0x7ff92532cdc9 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #11: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x86c (0x7ff8b81459cc in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #12: at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const + 0x17 (0x7ff8b83c4857 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #13: torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const + 0x2c2 (0x7ff925102b52 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #14: at::Tensor spconv::indiceConv<float>(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long) + 0x1be (0x7ff912386efe in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #15: void torch::jit::detail::callOperatorWithTuple<at::Tensor (* const)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long), at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul>(c10::FunctionSchema const&, at::Tensor (* const&&)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long), std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long>&, torch::Indices<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul>) + 0x267 (0x7ff91238e157 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #16: std::_Function_handler<int (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::jit::createOperator<at::Tensor (*)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long)>(std::string const&, at::Tensor (*&&)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long))::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x61 (0x7ff91238e3c1 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #17: <unknown function> + 0x3d93a5 (0x7ff926a353a5 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x130fac (0x7ff92678cfac in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: THPFunction_apply(_object*, _object*) + 0x6b1 (0x7ff926a10301 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

So.. can I know if which part is wrong?

(Was local RAM issue)