awslabs/amazon-eks-ami

bug(CUDA): Wrong CUDA version installed

Zujiry opened this issue · 4 comments

What happened:

I am using the AMI ami-0b73208d93a9261f4
In EC2 this gives me as source: amazon/amazon-eks-gpu-node-1.30-v20240928.
This version should have CUDA 12.2.2 installed.

If I execute nvidia-smi on the instance the result is though:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

This clearly states CUDA 12.4 is installed. This leads to tensorflow not working correctly, as it expects the library pxtas to be installed, which it is not.

What you expected to happen:

Having the CUDA version installed that is stated in the Release documentation

How to reproduce it (as minimally and precisely as possible):

Environment:

  • AWS Region: eu-central-1
  • Instance Type(s): g4dn.xlarge
  • Cluster Kubernetes version: 1.30
  • Node Kubernetes version: 1.30
  • AMI Version: ami-0b73208d93a9261f4

Tensorflow and its CUDA dependencies should be installed within your app container - are you using an official Tensorflow container image?

Tensorflow and its CUDA dependencies should be installed within your app container - are you using an official Tensorflow container image?

I have tried with a sample pod and these images on my EKS cluster:

  • nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
  • nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
  • nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.4
  • nvidia/cuda:12.6.1-cudnn8-runtime-ubuntu22.04 -> nvidia-smi = CUDA 12.6

I installed python3.10 and tensorflow 2.15 (https://www.tensorflow.org/install/source#gpu) on the image 12.2.2 and tried to execute my code leading to the error.

Could this lead to a wrong CUDA version output?
And I get CUDA 12.4 on the Instance itself if i use nvidia-smi

Using the tensorflow image actually solved the problem, even though the CUDA version still makes me wonder.
Thank you very much @bryantbiggs

what you see when you run nvidia-smi is actually the version of libcuda.so which is the driver CUDA that is installed and used by the NVIDIA driver. The version of CUDA that most users are interested in will be libcudart.so which is the CUDA runtime. If you look at Figure 1 here - https://docs.nvidia.com/deploy/cuda-compatibility/ - the grey box is what is provided in the EKS accelerated AMIs, the box above it is the application dependencies, developer tools, etc. that users will provide in their container image.

for most of the popular ML libraries and frameworks such as PyTorch and Tensorflow, those projects will provide their CUDA dependencies inside the container images that they create and supply. This includes the CUDA runtime (libcudart.so) as well as other CUDA libraries like cuBLAS, cuDNN, etc - the CUDA-x libraries that you see in this graphic https://blogs.nvidia.com/wp-content/uploads/2012/09/cuda-apps-and-libraries.png

Hopefully that helps clear up some of the confusion around the term CUDA!