When I degraded the jax, the code can run on GPU, but Not Enough GPU memory?
Tiramisu023 opened this issue · 1 comments
What is your installation issue?
Hello, I met the "Not Enough GPU memory" problem after I solved the problem of jax not recognition the GPU device.
The following is the error process.
I install Localcolabfold using "install_colabbatch_linux.sh". When I run the "colabfold_batch", the error of "no GPU detected, will be using CPU" occured. Then I checked whether the jax could recognize the GPU device (refer to #209).
$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
# CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# cpu
Then I checked the jax and jaxlib version,
COLABFOLDDIR="/public1/users/liyulong/software/localcolabfold"
"$COLABFOLDDIR/colabfold-conda/bin/pip" list | grep "jax"
# jax 0.4.23
# jaxlib 0.4.23+cuda11.cudnn86
I degraded the jax version to "jax==0.4.7, jaxlib==0.4.7+cuda11.cudnn86" (refer to #209). Then the jax can recognize the GPU device.
$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
gpu
Then the colabfold_batch met the problem "No module named 'jax.extend'" (refer to #224). I reinstalled the "dm-haiku==0.0.10". And the colabfold_batch could run on the GPU device. However, I met a new problem "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.".
I have two 2080 Ti (11GB * 2).
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:17:00.0 Off | N/A |
| 38% 41C P0 52W / 250W | 0MiB / 11264MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:25:00.0 Off | N/A |
| 25% 30C P0 21W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I have 450 amino acids in the fasta file. Is this problem caused by insufficient video memory? It seems that 40 GB of video memory still has this problem? (refer to #90)
In addition, given that I added CUDA 12.1 in my $PATH, I also tried to modify the "install_colabbatch_linux.sh" as suggested by A-Talavera (refer to #210).
I changed "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda11_pip]==0.4.23"
to "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda12_pip]==0.4.23"
And the jaxlib-0.4.23+cuda12.cudnn89 will be installed by default. Then I tried to degrade the jax to "jaxlib-0.4.7+cuda12.cudnn88" just following the same process as above. I can run colabfold_batch on GPU. But it still tell me "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details". And in #224, you said "jax-0.4.23+cuda11.cudnn86" was also ok for CUDA 12.1.
Computational environment
- OS: [e.g. Ubuntu 22.04, Windows10 & WSL2, macOS...]
$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
- CUDA version if Linux (Show the output of
/usr/local/cuda/bin/nvcc --version
.)
$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0
Since LocalColabFold requires CUDA 11.8+, I added CUDA 12.1 to the environment variable $PATH.
$ which nvcc
/usr/local/cuda-12.1/bin/nvcc
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Is it because the program calls CUDA 11.3 (/usr/local/cuda/bin/nvcc) instead of CUDA 12.1 in $PATH by default?
Looking forward to your reply. Thank you.
Yulong Li
I finally solved this "not Enough GPU memory" accoding to the solution in #224
pip install --upgrade nvidia-cudnn-cu11==8.5.0.96
This issue could be closed. I'm sorry to take up your time.