RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
McC0dy opened this issue · 2 comments
When training using any of the example configurations from the documentation I get the error:
"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"
Reproducing
For example running:
python main.py --network_type rnn --dataset wikitext
My system configuration
CUDA 10.1
Python 3.7.3
PyTorch 1.1.0
Arch Linux
GPU: RTX 2070
Other PyTorch applications work just fine.
Full output (from pipenv environment):
% python main.py --network_type rnn --dataset wikitext oliver@oliver
2019-06-14 16:30:31,585:INFO::[*] Make directories : logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:49,909:INFO::regularizing:
2019-06-14 16:30:54,743:INFO::# of parameters: 169,315,278
2019-06-14 16:30:54,834:INFO::[*] MODEL dir: logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:54,834:INFO::[*] PARAM path: logs/wikitext_2019-06-14_16-30-31/params.json
Traceback (most recent call last):
File "main.py", line 54, in <module>
main(args)
File "main.py", line 34, in main
trnr.train()
File "/home/oliver/code/ENAS-pytorch/trainer.py", line 222, in train
self.train_shared(dag=dag)
File "/home/oliver/code/ENAS-pytorch/trainer.py", line 305, in train_shared
dags)
File "/home/oliver/code/ENAS-pytorch/trainer.py", line 251, in get_loss
output, hidden, extra_out = self.shared(inputs, dag, hidden=hidden)
File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 235, in forward
logit, hidden = self.cell(x_t, hidden, dag)
File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 354, in cell
output = self.batch_norm(output)
File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
exponential_average_factor, self.eps)
File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Debugging
Debugging the parameters passed to batch_norm I found that the following parameters are all on cuda-device: input, weight, bias, running_mean, running_var. Which is all reasonable.
The remaining vars are reasonable as well.
Had same problem, the pytorch most widely used for NAS-related github repositories is 0.3.1 sometimes 0.2. I suggest you to try a downgrade.