Error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Question

Error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

souro opened this issue 4 years ago · 7 comments

I am running the below command:
python inference.py --config yelp_config.json --checkpoint working_dir/model.40.ckpt

Getting the below error:
opout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
"num_layers={}".format(dropout, num_layers))
2021-05-15 13:29:46,985 - INFO - MODEL HAS 9181445 params
Load from working_dir/model.40.ckpt sucessful!
Traceback (most recent call last):
File "inference.py", line 103, in
model = model.cuda()
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

No Idea why this error ... because my own other python project on gpu is working perfectly ... please let me know if you can figure out something from this. thank you...

Answer 1 · 2021-05-15T15:55:19.000Z

Hmm yeah it seems like this is a GPU error. Can you give me the output of nvidia-smi? What versions of cuda & pytorch are you using?

Answer 2 · 2021-05-17T12:03:51.000Z

CUDA version details: Cuda compilation tools, release 10.1, V10.1.105
pytorch version details: 1.1.0

*** I have used your provided requirements.txt only

Answer 3 · 2021-05-17T18:55:05.000Z

Hmm I wasn't able to reproduce this error. What is your GPU?

Can you give me the output of these commands?

nvidia-smi
python -c 'import torch; print(torch.cuda.is_available()); print(torch.__version__)'

I'd also try upgrading your pytorch beyond what's in the requirements.txt?

Answer 4 · 2021-06-05T11:18:54.000Z

I have the same trouble.
The output for the commands:

nvidia-smi
python -c 'import torch; print(torch.cuda.is_available()); print(torch.version)'
is
Sat Jun 5 20:14:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 24% 34C P8 17W / 250W | 22MiB / 11018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1183 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1694 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+

True
1.1.0

Answer 5 · 2021-06-05T11:26:47.000Z

Maybe this is because the pytorch version is 1.1.0 and this version is compatible with cudatoolkit=9.0/10.0, but my device's cuda version is 10.2?

Answer 6 · 2021-06-06T05:31:29.000Z

Hello, I think I may have solved this problem.
Firstly, I ran the requirements.txt.
Then I met that trouble.
Next, I pip uninstall torch torchvision, and use conda intsall pytorch==1.1.0 torchvison==0.3.0 cudatoolkit=10.0 -c pytorch
Finally, I ran python inference.py --config yelp_config.json this code successfully.

Answer 7 · 2021-06-07T15:40:56.000Z

Excellent!! I will update the FAQ to reflect your fix.