error:: an illegal memory access was encountered
Opened this issue · 10 comments
Hi,when i run this code with my own dataset,it errors as:
Use config:
{'CONST': {'DEVICE': '0', 'NUM_WORKERS': 0, 'N_INPUT_POINTS': 2048},
'DATASET': {'TEST_DATASET': 'Completion3D', 'TRAIN_DATASET': 'Completion3D'},
'DATASETS': {'COMPLETION3D': {'CATEGORY_FILE_PATH': './datasets/Completion3D.json',
'COMPLETE_POINTS_PATH': '/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/datasets/Completion3D/%s/gt/%s/%s.h5',
'PARTIAL_POINTS_PATH': '/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/datasets/Completion3D/%s/partial/%s/%s.h5'},
'KITTI': {'BOUNDING_BOX_FILE_PATH': '/home/SENSETIME/xiehaozhe/Datasets/KITTI/bboxes/%s.txt',
'CATEGORY_FILE_PATH': './datasets/KITTI.json',
'PARTIAL_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/KITTI/cars/%s.pcd'},
'SHAPENET': {'CATEGORY_FILE_PATH': './datasets/ShapeNet.json',
'COMPLETE_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/ShapeNet/ShapeNetCompletion/%s/complete/%s/%s.pcd',
'N_POINTS': 16384,
'N_RENDERINGS': 8,
'PARTIAL_POINTS_PATH': '/home/SENSETIME/xiehaozhe/Datasets/ShapeNet/ShapeNetCompletion/%s/partial/%s/%s/%02d.pcd'}},
'DIR': {'OUT_PATH': './output'},
'MEMCACHED': {'CLIENT_CONFIG': '/mnt/lustre/share/memcached_client/client.conf',
'ENABLED': False,
'LIBRARY_PATH': '/mnt/lustre/share/pymc/py3',
'SERVER_CONFIG': '/mnt/lustre/share/memcached_client/server_list.conf'},
'NETWORK': {'GRIDDING_LOSS_ALPHAS': [0.1],
'GRIDDING_LOSS_SCALES': [128],
'N_SAMPLING_POINTS': 2048},
'TEST': {'METRIC_NAME': 'ChamferDistance'},
'TRAIN': {'BATCH_SIZE': 1,
'BETAS': [0.9, 0.999],
'GAMMA': 0.5,
'LEARNING_RATE': 0.0001,
'LR_MILESTONES': [50],
'N_EPOCHS': 150,
'SAVE_FREQ': 25,
'WEIGHT_DECAY': 0}}
[INFO] 2021-01-04 11:01:13,525 Collecting files of Taxonomy [ID=all, Name=Uncategorized Test Set]
[INFO] 2021-01-04 11:01:13,529 Collecting files of Taxonomy [ID=02691156, Name=classic]
[INFO] 2021-01-04 11:01:13,532 Collecting files of Taxonomy [ID=02933112, Name=other]
[INFO] 2021-01-04 11:01:13,533 Complete collecting files of the dataset. Total files: 104
[INFO] 2021-01-04 11:01:13,534 Collecting files of Taxonomy [ID=all, Name=Uncategorized Test Set]
[INFO] 2021-01-04 11:01:13,535 Collecting files of Taxonomy [ID=02691156, Name=classic]
[INFO] 2021-01-04 11:01:13,537 Collecting files of Taxonomy [ID=02933112, Name=other]
[INFO] 2021-01-04 11:01:13,538 Complete collecting files of the dataset. Total files: 14
[DEBUG] 2021-01-04 11:01:14,724 Parameters in GRNet: 76707626.
[INFO] 2021-01-04 11:01:19,336 [Epoch 1/150][Batch 1/104] BatchTime = 0.697 (s) DataTime = 0.049 (s) Losses = ['535.0128', '533.6913']
[INFO] 2021-01-04 11:01:19,450 [Epoch 1/150][Batch 2/104] BatchTime = 0.114 (s) DataTime = 0.020 (s) Losses = ['581.4405', '579.0204']
[INFO] 2021-01-04 11:01:19,598 [Epoch 1/150][Batch 3/104] BatchTime = 0.147 (s) DataTime = 0.055 (s) Losses = ['758.7496', '758.8049']
[INFO] 2021-01-04 11:01:19,695 [Epoch 1/150][Batch 4/104] BatchTime = 0.098 (s) DataTime = 0.006 (s) Losses = ['695.8061', '692.7615']
[INFO] 2021-01-04 11:01:19,793 [Epoch 1/150][Batch 5/104] BatchTime = 0.097 (s) DataTime = 0.006 (s) Losses = ['554.6122', '544.2510']
[INFO] 2021-01-04 11:01:19,931 [Epoch 1/150][Batch 6/104] BatchTime = 0.138 (s) DataTime = 0.044 (s) Losses = ['539.5575', '530.9702']
[INFO] 2021-01-04 11:01:20,071 [Epoch 1/150][Batch 7/104] BatchTime = 0.141 (s) DataTime = 0.048 (s) Losses = ['556.2327', '553.9484']
[INFO] 2021-01-04 11:01:20,171 [Epoch 1/150][Batch 8/104] BatchTime = 0.099 (s) DataTime = 0.006 (s) Losses = ['682.0801', '675.5230']
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=77 : an illegal memory access was encountered
Did you meet the issue before? Thanks for advance.
Full error report as:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "/home/yaogan_504/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1448, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/yaogan_504/pycharm-community-2020.2.2/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/runner.py", line 76, in <module>
main()
File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/runner.py", line 58, in main
train_net(cfg)
File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/core/train.py", line 112, in train_net
sparse_ptcloud, dense_ptcloud = grnet(data)
File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/models/grnet.py", line 138, in forward
sparse_cloud = self.point_sampling(sparse_cloud, partial_cloud)
File "/home/yaogan_504/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/media/yaogan_504/5E4006134005F297/linhemin/GRNet-master/models/grnet.py", line 21, in forward
pred_cloud = torch.cat([partial_cloud, pred_cloud], dim=1)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
I've never met the issue before.
There are no other users reporting similar issues.
How about the ShapeNet dataset?
hi @hzxie
I also meet the same error. RuntimeError: CUDA error: an illegal memory access was encountered
Here is something may be helpful and may be something of feature sampling has error?
and it reports the "point features" has NaN is True.
tensor(True, device='cuda:0') =================point features after feature sample=================
Looking for your reply.
Best
@yjcaimeow
Maybe the coordinates of the points in the sparse_cloud
out of the range (-1 , 1)
.
I check the range of sparse_cloud and it is in [-1, 1]. The sparse cloud is get from the gridding_rev operation.
I also try to *0.5. The same error also exist. :(
hi, @lin1061991611 Maybe your extensions are complied with a not corresponding version of cuda or cudnn? I've met this issue with a tensorflow model once.
has anyone solved this issue? I'm using pytorch=1.4.0, cuda=10.1 with a fresh conda environment. Still getting this error.
has anyone solved this issue? I'm using pytorch=1.4.0, cuda=10.1 with a fresh conda environment. Still getting this error.
I think this is because its extension "griding". Its output tensors refuse to be readed, even if get printed, or refered to a specific GPU device.
Has anyone solved this issue? I'm using pytorch=1.8.1+cu111 with a fresh conda environment. Still getting this error.
I found the reason for my problem because the training data was not normalized.