Error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
souro opened this issue · 7 comments
I am running the below command:
python inference.py --config yelp_config.json --checkpoint working_dir/model.40.ckpt
Getting the below error:
opout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
"num_layers={}".format(dropout, num_layers))
2021-05-15 13:29:46,985 - INFO - MODEL HAS 9181445 params
Load from working_dir/model.40.ckpt sucessful!
Traceback (most recent call last):
File "inference.py", line 103, in
model = model.cuda()
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/lnet/spec/work/people/mukherjee/research/venvs/env_del_ret_gen/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
No Idea why this error ... because my own other python project on gpu is working perfectly ... please let me know if you can figure out something from this. thank you...
Hmm yeah it seems like this is a GPU error. Can you give me the output of nvidia-smi
? What versions of cuda & pytorch are you using?
CUDA version details: Cuda compilation tools, release 10.1, V10.1.105
pytorch version details: 1.1.0
*** I have used your provided requirements.txt only
Hmm I wasn't able to reproduce this error. What is your GPU?
Can you give me the output of these commands?
nvidia-smi
python -c 'import torch; print(torch.cuda.is_available()); print(torch.__version__)'
I'd also try upgrading your pytorch beyond what's in the requirements.txt?
I have the same trouble.
The output for the commands:
- nvidia-smi
- python -c 'import torch; print(torch.cuda.is_available()); print(torch.version)'
is
Sat Jun 5 20:14:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 24% 34C P8 17W / 250W | 22MiB / 11018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1183 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1694 G /usr/bin/gnome-shell 8MiB |
+-----------------------------------------------------------------------------+
True
1.1.0
Maybe this is because the pytorch version is 1.1.0 and this version is compatible with cudatoolkit=9.0/10.0, but my device's cuda version is 10.2?
Hello, I think I may have solved this problem.
Firstly, I ran the requirements.txt.
Then I met that trouble.
Next, I pip uninstall torch torchvision, and use conda intsall pytorch==1.1.0 torchvison==0.3.0 cudatoolkit=10.0 -c pytorch
Finally, I ran python inference.py --config yelp_config.json
this code successfully.
Excellent!! I will update the FAQ to reflect your fix.