RuntimeError: CUDA error: an illegal memory access was encountered
Closed this issue · 3 comments
Any idea why this might be happening?
`Starting initialization
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/checkpoints/resnet18-5c106cde.pth
Loading training dataset metadata:
- Can use 57874 images from the kitti (kitti_split) train set for depth training
- Can use 2975 images from the cityscapes train set for segmentation training
Loading validation dataset metadata: - Can use 1159 images from the kitti (kitti_split) validation set for depth validation
- Can use 500 images from the cityscapes validation set for segmentation validation
Summary: - Model name: sgdepth_chetan
- Logging directory: /cmudi001-sgd-1/SGDepth/Checkpoints/sgdepth_chetan_test/sgdepth_chetan
- Using device: cuda (GeForce GTX 1080 Ti)
100%|██████████| 46827520/46827520 [00:00<00:00, 57113495.00it/s]
Epoch 0: - kitti_kitti_train_depth losses at epoch 0 (batch 0):
- avg 0.1202
- cityscapes_train_seg losses at epoch 0 (batch 0):
- cross_entropy: 2.9558
- Breakdown of time spent this epoch:
- unaccounted: 0.000
- loading: 9.133
- optimizer: 0.000
- transfer: 0.075
- forward: 3.802
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "train.py", line 372, in
trainer.train()
File "train.py", line 340, in train
self._run_epoch()
File "train.py", line 266, in _run_epoch
loss.backward()
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered`
Usually I do not get such errors, however cuda errors are somewhat cryptical in most cases. Did you change anything in the code with respect to the standard configuration?
Also: Did you already try to run the code on a CPU, just to verify that it is running? If you get an error on the CPU sometimes it is more readable than on the GPU. The code automatically detects if it can be run on GPU and if not available it runs on the CPU.
I did not change anything in the code except for correct paths to data.
I tried running on CPU and it trains without any errors. For now, I will keep this issue open and try to debug around. Thank You.
After following some solutions on PyTorch discussions forum, the error is changed to this:
Traceback (most recent call last):
File "train.py", line 372, in
trainer.train()
File "train.py", line 340, in train
self._run_epoch()
File "train.py", line 249, in _run_epoch
outputs = model(batch)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 292, in forward
x = self.seg(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 79, in forward
x = self.decoder(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 134, in forward
x = self.blocksf'step_{step}'
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 69, in forward
x_new = torch.cat((x_new, x_skp), 1)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THC/THCCachingHostAllocator.cpp:278
UPDATE: It seems like the error is happening in the segmentation part. Can you please elaborate on how to prepare cityscapes dataset?