RuntimeError: CUDA error: an illegal memory access was encountered

Question

RuntimeError: CUDA error: an illegal memory access was encountered

Closed this issue 4 years ago · 3 comments

Any idea why this might be happening?

`Starting initialization
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/checkpoints/resnet18-5c106cde.pth
Loading training dataset metadata:

Can use 57874 images from the kitti (kitti_split) train set for depth training
Can use 2975 images from the cityscapes train set for segmentation training
Loading validation dataset metadata:
Can use 1159 images from the kitti (kitti_split) validation set for depth validation
Can use 500 images from the cityscapes validation set for segmentation validation
Summary:
Model name: sgdepth_chetan
Logging directory: /cmudi001-sgd-1/SGDepth/Checkpoints/sgdepth_chetan_test/sgdepth_chetan
Using device: cuda (GeForce GTX 1080 Ti)
100%|██████████| 46827520/46827520 [00:00<00:00, 57113495.00it/s]
Epoch 0:
kitti_kitti_train_depth losses at epoch 0 (batch 0):
- avg 0.1202
cityscapes_train_seg losses at epoch 0 (batch 0):
- cross_entropy: 2.9558
Breakdown of time spent this epoch:
- unaccounted: 0.000
- loading: 9.133
- optimizer: 0.000
- transfer: 0.075
- forward: 3.802
  THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
  Traceback (most recent call last):
  File "train.py", line 372, in
  trainer.train()
  File "train.py", line 340, in train
  self._run_epoch()
  File "train.py", line 266, in _run_epoch
  loss.backward()
  File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
  torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward
  allow_unreachable=True) # allow_unreachable flag
  RuntimeError: CUDA error: an illegal memory access was encountered`

Answer 1 · 2020-11-29T09:56:14.000Z

Usually I do not get such errors, however cuda errors are somewhat cryptical in most cases. Did you change anything in the code with respect to the standard configuration?

Also: Did you already try to run the code on a CPU, just to verify that it is running? If you get an error on the CPU sometimes it is more readable than on the GPU. The code automatically detects if it can be run on GPU and if not available it runs on the CPU.

Answer 2 · 2020-11-29T16:05:51.000Z

I did not change anything in the code except for correct paths to data.

I tried running on CPU and it trains without any errors. For now, I will keep this issue open and try to debug around. Thank You.

Answer 3 · 2020-11-29T16:35:06.000Z

After following some solutions on PyTorch discussions forum, the error is changed to this:

Traceback (most recent call last):
File "train.py", line 372, in
trainer.train()
File "train.py", line 340, in train
self._run_epoch()
File "train.py", line 249, in _run_epoch
outputs = model(batch)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 292, in forward
x = self.seg(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 79, in forward
x = self.decoder(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 134, in forward
x = self.blocksf'step_{step}'
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 69, in forward
x_new = torch.cat((x_new, x_skp), 1)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THC/THCCachingHostAllocator.cpp:278

UPDATE: It seems like the error is happening in the segmentation part. Can you please elaborate on how to prepare cityscapes dataset?