RuntimeError: CUDA error: an illegal memory access was encountered
Abastro opened this issue · 1 comments
Abastro commented
I keep getting into this problem, any ideas? It seems like it hits the wall every time.
Illegal memory access happens again when I do
These are the dependencies I have:
numpy 1.20.2
PG-OP 0.0.0
Pillow 8.2.0
pip 21.0.1
plyfile 0.7.4
protobuf 3.16.0
PyYAML 5.4.1
scipy 1.6.3
setuptools 52.0.0.post20210125
six 1.16.0
spconv 1.0
tensorboardX 2.2
torch 1.1.0
torchvision 0.3.0
wheel 0.36.2
With CUDA 10.2 and cudnn 7.6.5.
I was met with the following error.
/home/ubuntu/pointgroup-hs/PointGroup/util/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[2021-05-08 11:11:43,432 INFO log.py line 40 14354] ************************ Start Logging ************************
[2021-05-08 11:11:43,471 INFO train.py line 26 14354] Namespace(TEST_NMS_THRESH=0.3, TEST_NPOINT_THRESH=100, TEST_SCORE_THRESH=0.09, batch_size=4, bg_thresh=0.25, block_reps=2, block_residual=True, classes=20, cluster_meanActive=50, cluster_npoint_thre=50, cluster_radius=0.03, cluster_shift_meanActive=300, config='config/pointgroup_run1_scannet.yaml', data_root='dataset', dataset='scannetv2', dataset_dir='data/scannetv2_inst.py', epochs=384, eval=True, exp_path='exp/scannetv2/pointgroup/pointgroup_run1_scannet', fg_thresh=0.75, filename_suffix='_inst_nostuff.pth', fix_module=[], full_scale=[128, 512], ignore_label=-100, input_channel=3, loss_weight=[1.0, 1.0, 1.0, 1.0], lr=0.001, m=16, manual_seed=123, max_npoint=250000, mode=4, model_dir='model/pointgroup/pointgroup.py', model_name='pointgroup', momentum=0.9, multiplier=0.5, optim='Adam', prepare_epochs=128, pretrain='', pretrain_module=[], pretrain_path=None, save_freq=16, save_instance=False, save_pt_offsets=False, save_semantic=False, scale=50, score_fullscale=14, score_mode=4, score_scale=50, split='val', step_epoch=384, task='train', test_epoch=384, test_seed=567, test_workers=16, train_workers=16, use_coords=True, weight_decay=0.0001)
[2021-05-08 11:11:43,478 INFO train.py line 135 14354] => creating model ...
[2021-05-08 11:11:43,610 INFO train.py line 147 14354] cuda available: True
[2021-05-08 11:11:46,651 INFO train.py line 152 14354] #classifier parameters: 7715016
[2021-05-08 11:12:34,348 INFO scannetv2_inst.py line 43 14354] Training samples: 1201
[2021-05-08 11:12:46,605 INFO scannetv2_inst.py line 54 14354] Validation samples: 312
[2021-05-08 11:12:46,665 INFO utils.py line 61 14354] Restore from exp/scannetv2/pointgroup/pointgroup_run1_scannet/pointgroup_run1_scannet-000000001.pth
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "train.py", line 179, in <module>
train_epoch(dataset.train_data_loader, model, model_fn, optimizer, epoch)
File "train.py", line 54, in train_epoch
loss, _, visual_dict, meter_dict = model_fn(batch, model, epoch)
File "/home/ubuntu/pointgroup-hs/PointGroup/model/pointgroup/pointgroup.py", line 398, in model_fn
ret = model(input_, p2v_map, coords_float, coords[:, 0].int(), batch_offsets, epoch)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/pointgroup-hs/PointGroup/model/pointgroup/pointgroup.py", line 264, in forward
output = self.input_conv(input)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/modules.py", line 123, in forward
input = module(input)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/conv.py", line 157, in forward
outids.shape[0])
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/functional.py", line 83, in forward
return ops.indice_conv(features, filters, indice_pairs, indice_pair_num, num_activate_out, False, True)
File "/home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/ops.py", line 112, in indice_conv
int(inverse), int(subm))
RuntimeError: CUDA error: an illegal memory access was encountered (copy_to_cpu at /pytorch/aten/src/ATen/native/cuda/Copy.cu:199)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7ff926443441 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7ff926442d7a in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: (anonymous namespace)::copy_to_cpu(at::Tensor&, at::Tensor const&) + 0xa45 (0x7ff8c41b2a65 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: void (anonymous namespace)::_copy__cuda<int>(at::Tensor&, at::Tensor const&, bool) + 0x5ae (0x7ff8c425335e in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::_s_copy__cuda(at::Tensor&, at::Tensor const&, bool) + 0x378 (0x7ff8c41b45d8 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #5: at::native::_s_copy_from_cuda(at::Tensor const&, at::Tensor const&, bool) + 0x32 (0x7ff8c41b4c62 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #6: at::CUDAType::_s_copy_from(at::Tensor const&, at::Tensor const&, bool) const + 0xdd (0x7ff8c30bc78d in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so)
frame #7: at::native::_s_copy__cpu(at::Tensor&, at::Tensor const&, bool) + 0x5f (0x7ff8b8003e6f in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #8: <unknown function> + 0xb8cb9f (0x7ff8b82c5b9f in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x26d (0x7ff8b800333d in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #10: torch::autograd::VariableType::copy_(at::Tensor&, at::Tensor const&, bool) const + 0x629 (0x7ff92532cdc9 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #11: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x86c (0x7ff8b81459cc in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #12: at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const + 0x17 (0x7ff8b83c4857 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libcaffe2.so)
frame #13: torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const + 0x2c2 (0x7ff925102b52 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #14: at::Tensor spconv::indiceConv<float>(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long) + 0x1be (0x7ff912386efe in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #15: void torch::jit::detail::callOperatorWithTuple<at::Tensor (* const)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long), at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul>(c10::FunctionSchema const&, at::Tensor (* const&&)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long), std::vector<c10::IValue, std::allocator<c10::IValue> >&, std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long>&, torch::Indices<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul>) + 0x267 (0x7ff91238e157 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #16: std::_Function_handler<int (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::jit::createOperator<at::Tensor (*)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long)>(std::string const&, at::Tensor (*&&)(at::Tensor, at::Tensor, at::Tensor, at::Tensor, long, long, long))::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x61 (0x7ff91238e3c1 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/spconv/libspconv.so)
frame #17: <unknown function> + 0x3d93a5 (0x7ff926a353a5 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x130fac (0x7ff92678cfac in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: THPFunction_apply(_object*, _object*) + 0x6b1 (0x7ff926a10301 in /home/ubuntu/miniconda3/envs/pointgroup/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
So.. can I know if which part is wrong?
Abastro commented
(Was local RAM issue)