RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Question

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

McC0dy opened this issue 5 years ago · 2 comments

When training using any of the example configurations from the documentation I get the error:
"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

Reproducing
For example running:
python main.py --network_type rnn --dataset wikitext

My system configuration
CUDA 10.1
Python 3.7.3
PyTorch 1.1.0
Arch Linux
GPU: RTX 2070

Other PyTorch applications work just fine.

Full output (from pipenv environment):

% python main.py --network_type rnn --dataset wikitext                                                                    oliver@oliver
2019-06-14 16:30:31,585:INFO::[*] Make directories : logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:49,909:INFO::regularizing:
2019-06-14 16:30:54,743:INFO::# of parameters: 169,315,278
2019-06-14 16:30:54,834:INFO::[*] MODEL dir: logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:54,834:INFO::[*] PARAM path: logs/wikitext_2019-06-14_16-30-31/params.json
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    main(args)
  File "main.py", line 34, in main
    trnr.train()
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 222, in train
    self.train_shared(dag=dag)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 305, in train_shared
    dags)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 251, in get_loss
    output, hidden, extra_out = self.shared(inputs, dag, hidden=hidden)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 235, in forward
    logit, hidden = self.cell(x_t, hidden, dag)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 354, in cell
    output = self.batch_norm(output)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Debugging
Debugging the parameters passed to batch_norm I found that the following parameters are all on cuda-device: input, weight, bias, running_mean, running_var. Which is all reasonable.
The remaining vars are reasonable as well.

Answer 1 · 2019-06-18T22:12:41.000Z

Had same problem, the pytorch most widely used for NAS-related github repositories is 0.3.1 sometimes 0.2. I suggest you to try a downgrade.

Answer 2 · 2019-06-18T22:35:40.000Z

I think you should use v0.3.1 (links) which was released on Feb 13, 2018 because my initial commit was on Feb 14, 2018.