Issue with trainer; PyTorch doesn't seem to be working/using GPU
mountain-valley opened this issue · 4 comments
We are running tmrl on linux. Our GPU is Nvidia RTX 3090. The server and worker seem to run just fine. However, after a few minutes of running, the trainer outputs the following error:
/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Attempt to open cnn_infer failed: handle=0 error: /home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: undefined symbol: _Z22cudnnGenericOpTensorNdILi2EE13cudnnStatus_tP12cudnnContext16cudnnGenericOp_t21cudnnNanPropagation_tPK21cudnnActivationStructPKvPK17cudnnTensorStructS9_S9_SC_S9_S9_SC_Pv, version libcudnn_ops_infer.so.8 (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:78.)
return F.conv2d(input, weight, bias, self.stride,
Traceback (most recent call last):
File "/home/trackmania-rl/env-linux/custom_actor_module.py", line 860, in <module>
my_trainer.run_with_wandb(entity=wandb_entity,
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/tmrl/networking.py", line 419, in run_with_wandb
run_with_wandb(entity=entity,
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/tmrl/networking.py", line 317, in run_with_wandb
for stats in iterate_epochs_tm(run_cls, interface, checkpoint_path, dump_run_instance_fn, load_run_instance_fn, 1, updater_fn):
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/tmrl/networking.py", line 270, in iterate_epochs_tm
yield run_instance.run_epoch(interface=interface) # yield stats data frame (this makes this function a generator)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/tmrl/training_offline.py", line 127, in run_epoch
stats_training_dict = self.agent.train(batch)
File "/home/trackmania-rl/env-linux/custom_actor_module.py", line 737, in train
pi, logp_pi = self.model.actor(obs=o, test=False, compute_logprob=True)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trackmania-rl/env-linux/custom_actor_module.py", line 511, in forward
net_out = self.net(obs)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trackmania-rl/env-linux/custom_actor_module.py", line 355, in forward
x = F.relu(self.conv1(images))
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/trackmania-rl/env-linux/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation
What we have tried:
- ensured both nvidia-smi and nvcc -V output the same version, 12.3
- updated CUDA
- installed cuDNN
- running in virtual environment using venv
- running in conda environment
- running outside of an environment
- uninstalling and reinstalling PyTorch
Any suggestions?
Utch, I have never encountered this issue, sorry, and I run the Trainer on Linux (Ubuntu) all the time.
tmrl-side, the error happens on line 737 of your script when trying to execute a forward pass:
pi, logp_pi = self.model.actor(obs=o, test=False, compute_logprob=True)
in a Conv2d layer.
This SO and this post seem to suggest a CuDNN/CUDA/PyTorch compatibility issue on your system.
Perhaps ask on the PyTorch forum? ptrblck is the superhero there for this type of errors.
Btw do you have the same issue when trying to run the default example pipeline for TrackMania rather than this custom_actor_module.py script?
I've not tried TMRL default stuff since I upgraded to a 3090, but I have no issues running most of the TMRL pipeline on another project with my 3090.
(And I have tried multiple setups with different python version and CUDA versions).
Unfortunately, I have done everything on windows.
I would try downgrading the CUDA to 11.8
Not sure if it is relevant but this post said:
That’s right. You would need to use a properly installed NVIDIA driver, but don’t need a locally installed CUDA toolkit or cuDNN, since these are shipped as dependencies in the PyTorch binaries. Your locally installed CUDA toolkit (including cuDNN) would be used if you build PyTorch from source or…
https://discuss.pytorch.org/t/runtimeerror-get-was-unable-to-find-an-engine-to-execute-this-computation/193625
(BTW, it is very hard to read that error message without breaklines in the right place)
Closing for inactivity / most likely not a tmrl
issue