bug(CUDA): Wrong CUDA version installed
Zujiry opened this issue · 4 comments
What happened:
I am using the AMI ami-0b73208d93a9261f4
In EC2 this gives me as source: amazon/amazon-eks-gpu-node-1.30-v20240928.
This version should have CUDA 12.2.2 installed.
If I execute nvidia-smi on the instance the result is though:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 27C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
This clearly states CUDA 12.4 is installed. This leads to tensorflow not working correctly, as it expects the library pxtas to be installed, which it is not.
What you expected to happen:
Having the CUDA version installed that is stated in the Release documentation
How to reproduce it (as minimally and precisely as possible):
Environment:
- AWS Region: eu-central-1
- Instance Type(s): g4dn.xlarge
- Cluster Kubernetes version: 1.30
- Node Kubernetes version: 1.30
- AMI Version: ami-0b73208d93a9261f4
Tensorflow and its CUDA dependencies should be installed within your app container - are you using an official Tensorflow container image?
Tensorflow and its CUDA dependencies should be installed within your app container - are you using an official Tensorflow container image?
I have tried with a sample pod and these images on my EKS cluster:
- nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
- nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
- nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
- nvidia/cuda:12.6.1-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.6
I installed python3.10 and tensorflow 2.15 (https://www.tensorflow.org/install/source#gpu) on the image 12.2.2 and tried to execute my code leading to the error.
Could this lead to a wrong CUDA version output?
And I get CUDA 12.4 on the Instance itself if i use nvidia-smi
Using the tensorflow image actually solved the problem, even though the CUDA version still makes me wonder.
Thank you very much @bryantbiggs
what you see when you run nvidia-smi
is actually the version of libcuda.so
which is the driver CUDA that is installed and used by the NVIDIA driver. The version of CUDA that most users are interested in will be libcudart.so
which is the CUDA runtime. If you look at Figure 1 here - https://docs.nvidia.com/deploy/cuda-compatibility/ - the grey box is what is provided in the EKS accelerated AMIs, the box above it is the application dependencies, developer tools, etc. that users will provide in their container image.
for most of the popular ML libraries and frameworks such as PyTorch and Tensorflow, those projects will provide their CUDA dependencies inside the container images that they create and supply. This includes the CUDA runtime (libcudart.so
) as well as other CUDA libraries like cuBLAS, cuDNN, etc - the CUDA-x
libraries that you see in this graphic https://blogs.nvidia.com/wp-content/uploads/2012/09/cuda-apps-and-libraries.png
Hopefully that helps clear up some of the confusion around the term CUDA!