ifnspaml/SGDepth

RuntimeError: CUDA error: an illegal memory access was encountered

Closed this issue · 3 comments

Any idea why this might be happening?

`Starting initialization
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /root/.cache/torch/checkpoints/resnet18-5c106cde.pth
Loading training dataset metadata:

  • Can use 57874 images from the kitti (kitti_split) train set for depth training
  • Can use 2975 images from the cityscapes train set for segmentation training
    Loading validation dataset metadata:
  • Can use 1159 images from the kitti (kitti_split) validation set for depth validation
  • Can use 500 images from the cityscapes validation set for segmentation validation
    Summary:
  • Model name: sgdepth_chetan
  • Logging directory: /cmudi001-sgd-1/SGDepth/Checkpoints/sgdepth_chetan_test/sgdepth_chetan
  • Using device: cuda (GeForce GTX 1080 Ti)
    100%|██████████| 46827520/46827520 [00:00<00:00, 57113495.00it/s]
    Epoch 0:
  • kitti_kitti_train_depth losses at epoch 0 (batch 0):
    • avg 0.1202
  • cityscapes_train_seg losses at epoch 0 (batch 0):
    • cross_entropy: 2.9558
  • Breakdown of time spent this epoch:
    • unaccounted: 0.000
    • loading: 9.133
    • optimizer: 0.000
    • transfer: 0.075
    • forward: 3.802
      THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=77 : an illegal memory access was encountered
      Traceback (most recent call last):
      File "train.py", line 372, in
      trainer.train()
      File "train.py", line 340, in train
      self._run_epoch()
      File "train.py", line 266, in _run_epoch
      loss.backward()
      File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
      torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward
      allow_unreachable=True) # allow_unreachable flag
      RuntimeError: CUDA error: an illegal memory access was encountered`

Usually I do not get such errors, however cuda errors are somewhat cryptical in most cases. Did you change anything in the code with respect to the standard configuration?

Also: Did you already try to run the code on a CPU, just to verify that it is running? If you get an error on the CPU sometimes it is more readable than on the GPU. The code automatically detects if it can be run on GPU and if not available it runs on the CPU.

I did not change anything in the code except for correct paths to data.

I tried running on CPU and it trains without any errors. For now, I will keep this issue open and try to debug around. Thank You.

After following some solutions on PyTorch discussions forum, the error is changed to this:

Traceback (most recent call last):
File "train.py", line 372, in
trainer.train()
File "train.py", line 340, in train
self._run_epoch()
File "train.py", line 249, in _run_epoch
outputs = model(batch)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 292, in forward
x = self.seg(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/sgdepth.py", line 79, in forward
x = self.decoder(*x)
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 134, in forward
x = self.blocksf'step_{step}'
File "/opt/conda/envs/torch_110/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/cmudi001-sgd-1/SGDepth/models/networks/partial_decoder.py", line 69, in forward
x_new = torch.cat((x_new, x_skp), 1)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THC/THCCachingHostAllocator.cpp:278

UPDATE: It seems like the error is happening in the segmentation part. Can you please elaborate on how to prepare cityscapes dataset?