Cannot open libcuda.so.1
tbekolay opened this issue · 15 comments
Hello, I'm coming here from tensorflow/tensorflow#52988, in which @ngam recommended a conda install of tensorflow to solve issues requiring modifying LD_LIBRARY_PATH
. However, after installing I am still getting this output when trying to list the available GPUs:
$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2022-07-07 15:53:15.984325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-07 15:53:15.984343: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-07 15:53:15.984355: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: bekolay
2022-07-07 15:53:15.984358: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: bekolay
2022-07-07 15:53:15.984388: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-07-07 15:53:15.984409: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.91.3
[]
The relevant parts of conda list
are
$ conda list
cudatoolkit 11.2.2 hbe64b41_10 conda-forge
cudnn 8.2.1.32 h86fa8c9_0 conda-forge
...
keras 2.8.0 pyhd8ed1ab_0 conda-forge
...
libtensorflow 2.8.1 cpu_haf14b92_0 conda-forge
libtensorflow_cc 2.8.1 cpu_haf14b92_0 conda-forge
...
tensorflow 2.8.1 cuda112py310he87a039_0 conda-forge
tensorflow-base 2.8.1 cuda112py310h666ff7d_0 conda-forge
tensorflow-estimator 2.8.1 cuda112py310h2fa73eb_0 conda-forge
tensorflow-gpu 2.8.1 cuda112py310h0bbbad9_0 conda-forge
i'm confused how you have libtensorflow cpu and tensorflow cuda installed in the same env.
Maybe we made a mistake in the recipe. It seems we don't pin the subpackages too tightly enough. can you try installing the cuda112 version of those two packages?
I am getting a solving error trying to do so. I haven't made any changes since the last post.
$ mamba install 'libtensorflow==2.8.1=*cuda112*' 'libtensorflow_cc==2.8.1=*cuda112*' cudatoolkit==11.2
__ __ __ __
/ \ / \ / \ / \
/ \/ \/ \/ \
███████████████/ /██/ /██/ /██/ /████████████████████████
/ / \ / \ / \ / \ \____
/ / \_/ \_/ \_/ \ o \__,
/ _/ \_____/ `
|/
███╗ ███╗ █████╗ ███╗ ███╗██████╗ █████╗
████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
██╔████╔██║███████║██╔████╔██║██████╔╝███████║
██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
██║ ╚═╝ ██║██║ ██║██║ ╚═╝ ██║██████╔╝██║ ██║
╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝
mamba (0.24.0) supported by @QuantStack
GitHub: https://github.com/mamba-org/mamba
Twitter: https://twitter.com/QuantStack
█████████████████████████████████████████████████████████████
Looking for: ['libtensorflow==2.8.1[build=*cuda112*]', 'libtensorflow_cc==2.8.1[build=*cuda112*]', 'cudatoolkit==11.2']
conda-forge/linux-64 Using cache
conda-forge/noarch Using cache
anaconda/linux-64 Using cache
anaconda/noarch Using cache
pkgs/main/linux-64 No change
pkgs/r/noarch No change
pkgs/main/noarch No change
pkgs/r/linux-64 No change
Pinned packages:
- python 3.10.*
Encountered problems while solving:
- nothing provides __cuda needed by libtensorflow-2.8.1-cuda112h918e9ab_0
- nothing provides __cuda needed by libtensorflow_cc-2.8.1-cuda112h918e9ab_0
I'm currently trying this with conda
instead of mamba
and it's doing something but taking a while to do it ...
Please try to create a new environment, where you can share the full requested information.
conda info
conda list
Both Included
I created a new environment but still no luck. This might be an issue with using Python 3.10?
$ mamba create -n test python=3.10
...
$ mamba activate test
$ mamba install 'tensorflow==2.8.1=*cuda112*'
...
Looking for: ['tensorflow==2.8.1[build=*cuda112*]']
conda-forge/linux-64 Using cache
conda-forge/noarch Using cache
anaconda/linux-64 Using cache
anaconda/noarch Using cache
pkgs/r/linux-64 No change
pkgs/r/noarch No change
pkgs/main/noarch No change
pkgs/main/linux-64 No change
Pinned packages:
- python 3.10.*
Encountered problems while solving:
- nothing provides __cuda needed by tensorflow-2.8.1-cuda112py310he87a039_0
Here's the requested information
$ conda info
active environment : test
active env location : /home/tbekolay/Apps/mambaforge/envs/test
shell level : 1
user config file : /home/tbekolay/.condarc
populated config files : /home/tbekolay/Apps/mambaforge/.condarc
/home/tbekolay/.condarc
conda version : 4.12.0
conda-build version : not installed
python version : 3.9.13.final.0
virtual packages : __linux=5.10.0=0
__glibc=2.31=0
__unix=0=0
__archspec=1=x86_64
base environment : /home/tbekolay/Apps/mambaforge (writable)
conda av data dir : /home/tbekolay/Apps/mambaforge/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/anaconda/linux-64
https://conda.anaconda.org/anaconda/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/tbekolay/Apps/mambaforge/pkgs
/home/tbekolay/.conda/pkgs
envs directories : /home/tbekolay/Apps/mambaforge/envs
/home/tbekolay/.conda/envs
platform : linux-64
user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.13 Linux/5.10.0-15-amd64 debian/11 glibc/2.31
UID:GID : 1000:1000
netrc file : None
offline mode : False
$ conda list
# packages in environment at /home/tbekolay/Apps/mambaforge/envs/test:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2022.6.15 ha878542_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 12.1.0 h8d9b700_16 conda-forge
libgomp 12.1.0 h8d9b700_16 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libzlib 1.2.12 h166bdaf_1 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
openssl 3.0.5 h166bdaf_0 conda-forge
pip 22.1.2 pyhd8ed1ab_0 conda-forge
python 3.10.5 ha86cf86_0_cpython conda-forge
python_abi 3.10 2_cp310 conda-forge
readline 8.1.2 h0f457ee_0 conda-forge
setuptools 63.1.0 py310hff52083_0 conda-forge
sqlite 3.39.0 h4ff8645_0 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
tzdata 2022a h191b570_0 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.12 h166bdaf_1 conda-forge
After reading https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html I passed in CONDA_OVERRIDE_CUDA=11.2
to get around the __cuda
issue, but this causes mamba
to try to install cudatoolkit==11.7.0
. After manually installing what I believe to be the right packages, I'm now in this state:
$ mamba list
...
cudatoolkit 11.2.2 hbe64b41_10 conda-forge
cudnn 8.2.1.32 h86fa8c9_0 conda-forge
...
keras 2.8.0 pyhd8ed1ab_0 conda-forge
...
libtensorflow 2.8.1 cuda112h918e9ab_0 conda-forge
libtensorflow_cc 2.8.1 cuda112h918e9ab_0 conda-forge
...
tensorflow 2.8.1 cuda112py310he87a039_0 conda-forge
tensorflow-base 2.8.1 cuda112py310h666ff7d_0 conda-forge
tensorflow-estimator 2.8.1 cuda112py310h2fa73eb_0 conda-forge
tensorflow-gpu 2.8.1 cuda112py310h0bbbad9_0 conda-forge
However, the issue remains:
$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2022-07-08 10:40:44.550238: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-08 10:40:44.550254: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-08 10:40:44.550266: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: bekolay
2022-07-08 10:40:44.550271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: bekolay
2022-07-08 10:40:44.550303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-07-08 10:40:44.550321: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.91.3
[]
How did you isntall cuda? It seems Fonda can't detect it
After reading https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html I passed in
CONDA_OVERRIDE_CUDA=11.2
to get around the__cuda
issue, but this causesmamba
to try to installcudatoolkit==11.7.0
. After manually installing what I believe to be the right packages, I'm now in this state:
This means you don't have a GPU on this machine, or the GPU isn't configured properly. Otherwise, you wouldn't need to pass CONDA_OVERRIDE_CUDA=11.2
Can you share the output of nvidia-smi
?
There are a few issues at play, but the most important one is that we need to see if anything can see your GPU (if you have one). Otherwise, the warnings above make sense to me.
Also, please rely on our configuration --- and don't worry about how upstreams have it set up exactly (they do that mostly for testing). There is a wide and safe interoperability between cudatoolkits versions above 11.2, so you can safely get tensorflow==2.8.1=*cuda112*
along with cudatoolkit 11.7.0 (or 11.7.1 if we have it already) with the latest cudnn (8.4.1.50). So for all intents and purposes, if you want to use the gpu version of tensorflow here on a new-ish NVDA GPU, you can do simply do mamba install tensorflow==*=*cuda112*
and that should get you the most ideal configuration without any further steps (i.e. with cudatoolkit 11.7.x+ and cudnn 8.4.1.x+).
The setup you list above should certainly work, unless there is something really wrong in your GPU setup that we cannot see...
$ mamba list
...
cudatoolkit 11.2.2 hbe64b41_10 conda-forge
cudnn 8.2.1.32 h86fa8c9_0 conda-forge
...
keras 2.8.0 pyhd8ed1ab_0 conda-forge
...
libtensorflow 2.8.1 cuda112h918e9ab_0 conda-forge
libtensorflow_cc 2.8.1 cuda112h918e9ab_0 conda-forge
...
tensorflow 2.8.1 cuda112py310he87a039_0 conda-forge
tensorflow-base 2.8.1 cuda112py310h666ff7d_0 conda-forge
tensorflow-estimator 2.8.1 cuda112py310h2fa73eb_0 conda-forge
tensorflow-gpu 2.8.1 cuda112py310h0bbbad9_0 conda-forge
If nvidia-smi
works and shows a GPU, the only other thing I can think of is that you have your drivers somewhere unusual or maybe you're in a container somehow and you need a different base image? Otherwise, this is deeply puzzling to me...
I did not have nvidia-smi
installed; installing it shows the GPU as expected:
$ sudo nvidia-smi
Fri Jul 8 12:59:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 00000000:2D:00.0 On | N/A |
| 29% 68C P0 77W / 250W | 414MiB / 12209MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 811 G /usr/lib/xorg/Xorg 276MiB |
| 0 N/A N/A 2862 G ...045272266784723622,131072 133MiB |
+-----------------------------------------------------------------------------+
The N/A for CUDA version is likely because I haven't installed any CUDA libraries system-wide, but that's intentional as I intend to use the cuda toolkit from conda
. What is conda
using to populate the __cuda
virtual package? Is nvidia-smi
required during certain install steps even though it is not really necessary to run things on the GPU?
Also, please rely on our configuration --- and don't worry about how upstreams have it set up exactly (they do that mostly for testing). There is a wide and safe interoperability between cudatoolkits versions above 11.2, so you can safely get
tensorflow==2.8.1=*cuda112*
along with cudatoolkit 11.7.0 (or 11.7.1 if we have it already) with the latest cudnn (8.4.1.50).
That's great to know as it's one of the particularly annoying parts of installing TF through other methods :)
If
nvidia-smi
works and shows a GPU, the only other thing I can think of is that you have your drivers somewhere unusual or maybe you're in a container somehow and you need a different base image?
I think I'm on a pretty predictable setup, Debian 11 (Bullseye) with nvidia-driver
installed through apt
. Definitely not in a container.
I tried making a new environment now that nvidia-smi
is installed, and used the recommended CUDA 11.7 and cuDNN 8.4 but got the same result. Something perhaps useful to know is that there is indeed no libcuda.so
on my system:
$ locate libcuda.so
Though other libraries are present:
$ locate libcudnn.so
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so.8.4.1
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so.8.2.1
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8.4.1
Does conda expect me to also have CUDA system libraries installed?
Does conda expect me to also have CUDA system libraries installed?
I believe conda expects to find libcuda.so.1 in /usr/lib/ but I can be wrong, let's wait for hmaarrfk to give you a more accurate answer. Let me look into the docs to see if we have anything on this...
Could you see if this (somewhat outdated) section answers your question? You simply need a recent enough NVIDIA driver.
https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/#software-requirements
You can get a driver from here: https://www.nvidia.com/drivers
(marking TODO: we need to add something like this to our own conda-forge docs, preferably under this tips&tricks entry: https://conda-forge.org/docs/user/tipsandtricks.html#installing-cuda-enabled-packages-like-tensorflow-and-pytorch and here https://conda-forge.org/blog/posts/2021-11-03-tensorflow-gpu/)
libcuda.so
cannot be distributed via conda-forge because NVIDIA’s EULA does not allow redistribution. One needs to install it manually.
It is all working now! I had to
sudo apt install libcuda1
to get libcuda.so
. Then when I restarted, conda info
showed the __cuda
virtual package, so I made a new environment and just installed the tf 2.8.1 package, which was sufficient to get access to the GPU through a python interpreter.
libcuda.so
cannot be distributed via conda-forge because NVIDIA’s EULA does not allow redistribution. One needs to install it manually.
I do definitely remember at some point that I was able to get CUDA-enabled TensorFlow working without manually installing libcuda1
, but it's also possible that some other system package depended on it and so it was installed even though I didn't manually install it. It is worth noting that Debian is able to package up libcuda.so
through some presumably legal means, so perhaps there is some way to do it.
In any case, this is all working for me now, and likely will for other people experiencing this issue if they install the system CUDA library. Thanks everyone for the help!