Installed with cuda support but doesn't detect gpu runtime
Closed this issue · 4 comments
I have installed bitsandbytes with cuda support, but I receive an error message that bitsandbytes are not compiled for GPU. I am pretty much confused with the following output. While it detects cuda runtime, it complains about not having gpu support.
$ conda list | grep bitsandbytes
bitsandbytes 0.44.1 cuda120_py310hdc26961_1 conda-forge
$ conda list | grep torch
ffmpeg 4.3 hf484d3e_0 pytorch
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
pytorch 2.5.1 py3.10_cuda12.1_cudnn9.1.0_0 pytorch
pytorch-cuda 12.1 ha16c6d3_6 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 2.5.1 py310_cu121 pytorch
torchtriton 3.1.0 py310 pytorch
torchvision 0.20.1 py310_cu121 pytorch
$ which nvcc
/beegfs/apps/generic/cuda-12.1/bin/nvcc
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
$ python -m bitsandbytes
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
CUDA specs: None
Torch says CUDA is not available. Possible reasons:
1. CUDA driver not installed
2. CUDA not installed
3. You have multiple conflicting CUDA libraries
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/bin
The directory listed in your path is found to be non-existent: 2;/apps/generic/miniconda3/4.12.0/condabin
The directory listed in your path is found to be non-existent: 1;/scratch/mahmood/git-lfs-3.5.1
The directory listed in your path is found to be non-existent: 1;/home/mahmood/.local/bin
The directory listed in your path is found to be non-existent: 1;/home/mahmood/bin
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/bin
The directory listed in your path is found to be non-existent: 1;/usr/local/bin
The directory listed in your path is found to be non-existent: 1;/usr/bin
The directory listed in your path is found to be non-existent: 1;/usr/local/sbin
The directory listed in your path is found to be non-existent: 1;/usr/sbin
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles/DefaultModules.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles/miniconda3/4.12.0.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/gcc/8.5.0/gcc/11.3.0.lua
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles/2024r1.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles/cuda/12.1.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/lib64
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/lib
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/lib64
The directory listed in your path is found to be non-existent: /apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/etc/scl/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/arch/2024r1/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/apps/arch/2024r1/extra/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/gcc/8.5.0
The directory listed in your path is found to be non-existent: slurm/current
The directory listed in your path is found to be non-existent: 1;miniconda3/4.12.0
The directory listed in your path is found to be non-existent: 1;gcc/11.3.0
The directory listed in your path is found to be non-existent: 1;cuda/12.1
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/lib64
The directory listed in your path is found to be non-existent: slurm/current
The directory listed in your path is found to be non-existent: miniconda3/4.12.0
The directory listed in your path is found to be non-existent: gcc/11.3.0
The directory listed in your path is found to be non-existent: cuda/12.1
The directory listed in your path is found to be non-existent: 2;/cm/shared/apps/slurm/current/man
The directory listed in your path is found to be non-existent: 1;/usr/share/lmod/lmod/share/man
The directory listed in your path is found to be non-existent: 1;/usr/local/share/man
The directory listed in your path is found to be non-existent: 1;/usr/share/man
The directory listed in your path is found to be non-existent: 1;/cm/local/apps/environment-modules/current/share/man
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/include
The directory listed in your path is found to be non-existent: /apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/Core
Found duplicate CUDA runtime files (see below).
We select the PyTorch default CUDA runtime, which is 12.1,
but this might mismatch with the CUDA version that is needed for bitsandbytes.
To override this behavior set the `BNB_CUDA_VERSION=<version string, e.g. 122>` environmental variable.
For example, if you want to use the CUDA version 122,
BNB_CUDA_VERSION=122 python ...
OR set the environmental variable in your .bashrc:
export BNB_CUDA_VERSION=122
In the case of a manual override, make sure you set LD_LIBRARY_PATH, e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2,
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Checking that the library is importable and CUDA is callable...
Traceback (most recent call last):
File "/home/mahmood/.conda/envs/llama3-train/lib/python3.10/site-packages/bitsandbytes/diagnostics/main.py", line 66, in main
sanity_check()
File "/home/mahmood/.conda/envs/llama3-train/lib/python3.10/site-packages/bitsandbytes/diagnostics/main.py", line 33, in sanity_check
p = torch.nn.Parameter(torch.rand(10, 10).cuda())
File "/home/mahmood/.conda/envs/llama3-train/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Above we output some debug information.
Please provide this info when creating an issue via https://github.com/TimDettmers/bitsandbytes/issues/new/choose
WARNING: Please be sure to sanitize sensitive info from the output before posting it.
How can I fix that?
Let's start with verifying that your PyTorch installation is working on GPU correctly. Can you share the output of the following?
python -c 'import torch;print(torch.__config__.show())'
We should hopefully see that PyTorch is built with CUDA support, but I'm expecting it may be missing here. If that is the case then I would suggest to try reinstalling PyTorch.
Hello author, why did I install it according to your bitsandbytes==0.42.0, but the runtime shows that bitsandbytes does not support CUDA, when running python -c 'import torch; print(torch.config.show())' and my CUDA status is ON and it returns as ture.
@matthewdouglas I found the problem. First, as you mentioned, I was not on a GPU node. On the cluster system that I use, I was running those commands on the login node. Although the cuda modules were loaded, the correct output has to be seen on a GPU node running the driver. So here is the output:
$ python -c 'import torch;print(torch.__config__.show())'
PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 90.1 (built against CUDA 12.4)
- ...
$ python -m bitsandbytes
Could not find the bitsandbytes CUDA binary at PosixPath('/home/mnaderantahan/.conda/envs/llama3-train/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
CUDA specs: CUDASpecs(highest_compute_capability=(8, 0), cuda_version_string='121', cuda_version_tuple=(12, 1))
PyTorch settings found: CUDA_VERSION=121, Highest Compute Capability: (8, 0).
Library not found: /home/mnaderantahan/.conda/envs/llama3-train/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so. Maybe you need to compile it from source?
If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION`,
for example, `make CUDA_VERSION=113`.
The CUDA version for the compile might depend on your conda install, if using conda.
Inspect CUDA version via `conda list | grep cuda`.
To manually override the PyTorch CUDA version please see: https://github.com/TimDettmers/bitsandbytes/blob/main/docs/source/nonpytorchcuda.mdx
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/bin
The directory listed in your path is found to be non-existent: 2;/scratch/mnaderantahan/git-lfs-3.5.1
The directory listed in your path is found to be non-existent: 2;/home/mnaderantahan/.local/bin
The directory listed in your path is found to be non-existent: 1;/home/mnaderantahan/bin
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/bin
The directory listed in your path is found to be non-existent: 1;/usr/local/bin
The directory listed in your path is found to be non-existent: 1;/usr/bin
The directory listed in your path is found to be non-existent: 1;/usr/local/sbin
The directory listed in your path is found to be non-existent: 1;/usr/sbin
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles/DefaultModules.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/gcc/8.5.0/gcc/11.3.0.lua
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles/2024r1.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles/cuda/12.1.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles/miniconda3/4.12.0.lua
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/lib64
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/software/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.3.0-ifl4t3krdzkcrmejbgj5reljeokuv3vs/lib
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/lib64
The directory listed in your path is found to be non-existent: /apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/etc/scl/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/noarch/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/generic/modulefiles
The directory listed in your path is found to be non-existent: 1;/apps/arch/2024r1/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/apps/arch/2024r1/extra/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: 1;/apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/gcc/8.5.0
The directory listed in your path is found to be non-existent: slurm/current
The directory listed in your path is found to be non-existent: 1;gcc/11.3.0
The directory listed in your path is found to be non-existent: 1;cuda/12.1
The directory listed in your path is found to be non-existent: 1;miniconda3/4.12.0
The directory listed in your path is found to be non-existent: /tmp/krb5cc_699194_4qmCdu
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/lib64
The directory listed in your path is found to be non-existent: slurm/current
The directory listed in your path is found to be non-existent: gcc/11.3.0
The directory listed in your path is found to be non-existent: cuda/12.1
The directory listed in your path is found to be non-existent: miniconda3/4.12.0
The directory listed in your path is found to be non-existent: 2;/cm/shared/apps/slurm/current/man
The directory listed in your path is found to be non-existent: 1;/usr/share/lmod/lmod/share/man
The directory listed in your path is found to be non-existent: 1;/usr/local/share/man
The directory listed in your path is found to be non-existent: 1;/usr/share/man
The directory listed in your path is found to be non-existent: 1;/cm/local/apps/environment-modules/current/share/man
The directory listed in your path is found to be non-existent: 1;/cm/shared/apps/slurm/current/include
The directory listed in your path is found to be non-existent: /apps/generic/compiler-2024/lmod/linux-rhel8-x86_64/Core
The directory listed in your path is found to be non-existent: /etc/scl/modulefiles
Found duplicate CUDA runtime files (see below).
We select the PyTorch default CUDA runtime, which is 12.1,
but this might mismatch with the CUDA version that is needed for bitsandbytes.
To override this behavior set the `BNB_CUDA_VERSION=<version string, e.g. 122>` environmental variable.
For example, if you want to use the CUDA version 122,
BNB_CUDA_VERSION=122 python ...
OR set the environmental variable in your .bashrc:
export BNB_CUDA_VERSION=122
In the case of a manual override, make sure you set LD_LIBRARY_PATH, e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2,
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12.1.105
* Found CUDA runtime at: /beegfs/apps/generic/cuda-12.1/lib64/libcudart.so.12
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I have installed bitsandbytes via conda, so I only see 12.0 library file:
$ ls /home/mahmood/.conda/envs/llama3-train/lib/python3.10/site-packages/bitsandbytes
autograd consts.py diagnostics __init__.py libbitsandbytes_cuda120.so nn __pycache__ triton
cextension.py cuda_specs.py functional.py libbitsandbytes_cpu.so __main__.py optim research utils.py
Based on my searches, there is no conda package for cuda 12.1 which I have loaded its module. So, I guess, I have to compile from source.
OK I installed from source for the cuda version on the gpu node and it is now fine. Thank you.