conda-forge/tensorflow-feedstock

Cannot open libcuda.so.1

tbekolay opened this issue · 15 comments

Hello, I'm coming here from tensorflow/tensorflow#52988, in which @ngam recommended a conda install of tensorflow to solve issues requiring modifying LD_LIBRARY_PATH. However, after installing I am still getting this output when trying to list the available GPUs:

$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2022-07-07 15:53:15.984325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-07 15:53:15.984343: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-07 15:53:15.984355: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: bekolay
2022-07-07 15:53:15.984358: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: bekolay
2022-07-07 15:53:15.984388: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-07-07 15:53:15.984409: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.91.3
[]

The relevant parts of conda list are

$ conda list
cudatoolkit               11.2.2              hbe64b41_10    conda-forge
cudnn                     8.2.1.32             h86fa8c9_0    conda-forge
...
keras                     2.8.0              pyhd8ed1ab_0    conda-forge
...
libtensorflow             2.8.1            cpu_haf14b92_0    conda-forge
libtensorflow_cc          2.8.1            cpu_haf14b92_0    conda-forge
...
tensorflow                2.8.1           cuda112py310he87a039_0    conda-forge
tensorflow-base           2.8.1           cuda112py310h666ff7d_0    conda-forge
tensorflow-estimator      2.8.1           cuda112py310h2fa73eb_0    conda-forge
tensorflow-gpu            2.8.1           cuda112py310h0bbbad9_0    conda-forge

i'm confused how you have libtensorflow cpu and tensorflow cuda installed in the same env.

Maybe we made a mistake in the recipe. It seems we don't pin the subpackages too tightly enough. can you try installing the cuda112 version of those two packages?

I am getting a solving error trying to do so. I haven't made any changes since the last post.

$ mamba install 'libtensorflow==2.8.1=*cuda112*' 'libtensorflow_cc==2.8.1=*cuda112*' cudatoolkit==11.2

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.24.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['libtensorflow==2.8.1[build=*cuda112*]', 'libtensorflow_cc==2.8.1[build=*cuda112*]', 'cudatoolkit==11.2']

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
anaconda/linux-64                                           Using cache
anaconda/noarch                                             Using cache
pkgs/main/linux-64                                            No change
pkgs/r/noarch                                                 No change
pkgs/main/noarch                                              No change
pkgs/r/linux-64                                               No change

Pinned packages:
  - python 3.10.*


Encountered problems while solving:
  - nothing provides __cuda needed by libtensorflow-2.8.1-cuda112h918e9ab_0
  - nothing provides __cuda needed by libtensorflow_cc-2.8.1-cuda112h918e9ab_0

I'm currently trying this with conda instead of mamba and it's doing something but taking a while to do it ...

Please try to create a new environment, where you can share the full requested information.

conda info
conda list

Both Included

I created a new environment but still no luck. This might be an issue with using Python 3.10?

$ mamba create -n test python=3.10
...
$ mamba activate test
$ mamba install 'tensorflow==2.8.1=*cuda112*'
...
Looking for: ['tensorflow==2.8.1[build=*cuda112*]']

conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
anaconda/linux-64                                           Using cache
anaconda/noarch                                             Using cache
pkgs/r/linux-64                                               No change
pkgs/r/noarch                                                 No change
pkgs/main/noarch                                              No change
pkgs/main/linux-64                                            No change

Pinned packages:
  - python 3.10.*


Encountered problems while solving:
  - nothing provides __cuda needed by tensorflow-2.8.1-cuda112py310he87a039_0

Here's the requested information

$ conda info
     active environment : test
    active env location : /home/tbekolay/Apps/mambaforge/envs/test
            shell level : 1
       user config file : /home/tbekolay/.condarc
 populated config files : /home/tbekolay/Apps/mambaforge/.condarc
                          /home/tbekolay/.condarc
          conda version : 4.12.0
    conda-build version : not installed
         python version : 3.9.13.final.0
       virtual packages : __linux=5.10.0=0
                          __glibc=2.31=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /home/tbekolay/Apps/mambaforge  (writable)
      conda av data dir : /home/tbekolay/Apps/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/anaconda/linux-64
                          https://conda.anaconda.org/anaconda/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/tbekolay/Apps/mambaforge/pkgs
                          /home/tbekolay/.conda/pkgs
       envs directories : /home/tbekolay/Apps/mambaforge/envs
                          /home/tbekolay/.conda/envs
               platform : linux-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.13 Linux/5.10.0-15-amd64 debian/11 glibc/2.31
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

$ conda list
# packages in environment at /home/tbekolay/Apps/mambaforge/envs/test:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.6.15            ha878542_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.12               h166bdaf_1    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
openssl                   3.0.5                h166bdaf_0    conda-forge
pip                       22.1.2             pyhd8ed1ab_0    conda-forge
python                    3.10.5          ha86cf86_0_cpython    conda-forge
python_abi                3.10                    2_cp310    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
setuptools                63.1.0          py310hff52083_0    conda-forge
sqlite                    3.39.0               h4ff8645_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.12               h166bdaf_1    conda-forge

After reading https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html I passed in CONDA_OVERRIDE_CUDA=11.2 to get around the __cuda issue, but this causes mamba to try to install cudatoolkit==11.7.0. After manually installing what I believe to be the right packages, I'm now in this state:

$ mamba list
...
cudatoolkit               11.2.2              hbe64b41_10    conda-forge
cudnn                     8.2.1.32             h86fa8c9_0    conda-forge
...
keras                     2.8.0              pyhd8ed1ab_0    conda-forge
...
libtensorflow             2.8.1           cuda112h918e9ab_0    conda-forge
libtensorflow_cc          2.8.1           cuda112h918e9ab_0    conda-forge
...
tensorflow                2.8.1           cuda112py310he87a039_0    conda-forge
tensorflow-base           2.8.1           cuda112py310h666ff7d_0    conda-forge
tensorflow-estimator      2.8.1           cuda112py310h2fa73eb_0    conda-forge
tensorflow-gpu            2.8.1           cuda112py310h0bbbad9_0    conda-forge

However, the issue remains:

$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2022-07-08 10:40:44.550238: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-08 10:40:44.550254: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-08 10:40:44.550266: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: bekolay
2022-07-08 10:40:44.550271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: bekolay
2022-07-08 10:40:44.550303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2022-07-08 10:40:44.550321: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.91.3
[]

How did you isntall cuda? It seems Fonda can't detect it

ngam commented

After reading https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html I passed in CONDA_OVERRIDE_CUDA=11.2 to get around the __cuda issue, but this causes mamba to try to install cudatoolkit==11.7.0. After manually installing what I believe to be the right packages, I'm now in this state:

This means you don't have a GPU on this machine, or the GPU isn't configured properly. Otherwise, you wouldn't need to pass CONDA_OVERRIDE_CUDA=11.2

ngam commented

Can you share the output of nvidia-smi?

ngam commented

There are a few issues at play, but the most important one is that we need to see if anything can see your GPU (if you have one). Otherwise, the warnings above make sense to me.

Also, please rely on our configuration --- and don't worry about how upstreams have it set up exactly (they do that mostly for testing). There is a wide and safe interoperability between cudatoolkits versions above 11.2, so you can safely get tensorflow==2.8.1=*cuda112* along with cudatoolkit 11.7.0 (or 11.7.1 if we have it already) with the latest cudnn (8.4.1.50). So for all intents and purposes, if you want to use the gpu version of tensorflow here on a new-ish NVDA GPU, you can do simply do mamba install tensorflow==*=*cuda112* and that should get you the most ideal configuration without any further steps (i.e. with cudatoolkit 11.7.x+ and cudnn 8.4.1.x+).

The setup you list above should certainly work, unless there is something really wrong in your GPU setup that we cannot see...

$ mamba list
...
cudatoolkit               11.2.2              hbe64b41_10    conda-forge
cudnn                     8.2.1.32             h86fa8c9_0    conda-forge
...
keras                     2.8.0              pyhd8ed1ab_0    conda-forge
...
libtensorflow             2.8.1           cuda112h918e9ab_0    conda-forge
libtensorflow_cc          2.8.1           cuda112h918e9ab_0    conda-forge
...
tensorflow                2.8.1           cuda112py310he87a039_0    conda-forge
tensorflow-base           2.8.1           cuda112py310h666ff7d_0    conda-forge
tensorflow-estimator      2.8.1           cuda112py310h2fa73eb_0    conda-forge
tensorflow-gpu            2.8.1           cuda112py310h0bbbad9_0    conda-forge
ngam commented

If nvidia-smi works and shows a GPU, the only other thing I can think of is that you have your drivers somewhere unusual or maybe you're in a container somehow and you need a different base image? Otherwise, this is deeply puzzling to me...

I did not have nvidia-smi installed; installing it shows the GPU as expected:

$ sudo nvidia-smi
Fri Jul  8 12:59:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:2D:00.0  On |                  N/A |
| 29%   68C    P0    77W / 250W |    414MiB / 12209MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       811      G   /usr/lib/xorg/Xorg                276MiB |
|    0   N/A  N/A      2862      G   ...045272266784723622,131072      133MiB |
+-----------------------------------------------------------------------------+

The N/A for CUDA version is likely because I haven't installed any CUDA libraries system-wide, but that's intentional as I intend to use the cuda toolkit from conda. What is conda using to populate the __cuda virtual package? Is nvidia-smi required during certain install steps even though it is not really necessary to run things on the GPU?

Also, please rely on our configuration --- and don't worry about how upstreams have it set up exactly (they do that mostly for testing). There is a wide and safe interoperability between cudatoolkits versions above 11.2, so you can safely get tensorflow==2.8.1=*cuda112* along with cudatoolkit 11.7.0 (or 11.7.1 if we have it already) with the latest cudnn (8.4.1.50).

That's great to know as it's one of the particularly annoying parts of installing TF through other methods :)

If nvidia-smi works and shows a GPU, the only other thing I can think of is that you have your drivers somewhere unusual or maybe you're in a container somehow and you need a different base image?

I think I'm on a pretty predictable setup, Debian 11 (Bullseye) with nvidia-driver installed through apt. Definitely not in a container.

I tried making a new environment now that nvidia-smi is installed, and used the recommended CUDA 11.7 and cuDNN 8.4 but got the same result. Something perhaps useful to know is that there is indeed no libcuda.so on my system:

$ locate libcuda.so

Though other libraries are present:

$ locate libcudnn.so
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/envs/apps/lib/libcudnn.so.8.4.1
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib/libcudnn.so.8.2.1
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8
/home/tbekolay/Apps/mambaforge/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8.4.1

Does conda expect me to also have CUDA system libraries installed?

ngam commented

Does conda expect me to also have CUDA system libraries installed?

I believe conda expects to find libcuda.so.1 in /usr/lib/ but I can be wrong, let's wait for hmaarrfk to give you a more accurate answer. Let me look into the docs to see if we have anything on this...

ngam commented

Could you see if this (somewhat outdated) section answers your question? You simply need a recent enough NVIDIA driver.

https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/#software-requirements

You can get a driver from here: https://www.nvidia.com/drivers

(marking TODO: we need to add something like this to our own conda-forge docs, preferably under this tips&tricks entry: https://conda-forge.org/docs/user/tipsandtricks.html#installing-cuda-enabled-packages-like-tensorflow-and-pytorch and here https://conda-forge.org/blog/posts/2021-11-03-tensorflow-gpu/)

njzjz commented

libcuda.so cannot be distributed via conda-forge because NVIDIA’s EULA does not allow redistribution. One needs to install it manually.

ngam commented

@tbekolay just following up, all good or still not working?

It is all working now! I had to

sudo apt install libcuda1

to get libcuda.so. Then when I restarted, conda info showed the __cuda virtual package, so I made a new environment and just installed the tf 2.8.1 package, which was sufficient to get access to the GPU through a python interpreter.

libcuda.so cannot be distributed via conda-forge because NVIDIA’s EULA does not allow redistribution. One needs to install it manually.

I do definitely remember at some point that I was able to get CUDA-enabled TensorFlow working without manually installing libcuda1, but it's also possible that some other system package depended on it and so it was installed even though I didn't manually install it. It is worth noting that Debian is able to package up libcuda.so through some presumably legal means, so perhaps there is some way to do it.

In any case, this is all working for me now, and likely will for other people experiencing this issue if they install the system CUDA library. Thanks everyone for the help!