awslabs/amazon-eks-ami

Please add nvidia driver version 520 for GPU enabled EKS AMI image

jsuto opened this issue ยท 12 comments

jsuto commented

What would you like to be added:

nvidia driver version 520 and related packages need on a GPU enabled EKS host.

Why is this needed:

The current EKS AMI features nvidia driver version 470. However, we have a software that requires a newer version. nvidia driver 510 seems to work for us, though it might be better to ship the latest version 520.

Has there been any movement on this? We're using jax which is very particular about matching cuda and nvidia driver releases (so 470 means the highest cuda we can use is 11.4.) Now that cuda 12 is out, any chance we can get the driver version bumped?

al1y commented

Any progress here? Tried building my own custom image but had 0 luck

I haven't tried it yet but I believe the best solution here is:

  1. Move off the Amazon AMIs completely to https://cloud-images.ubuntu.com/docs/aws/eks/
  2. Install this operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/amazon-eks.html which let's you use whatever version of cuda you want. Then you can dump the device plugin too.
al1y commented

๐Ÿ‘ will give that a go - ty for response!

howd it go? we'd very much like to use torch 2 and need 510 drivers, which unfortunately still seem like they are not supported by default.

We disabled GPU support on our EKS cluster and moved everything to GKE, where the default image is currently on driver version 525. Plus A100s come in more shapes and are cheaper there.

I tried basically following only step 2 of your suggestion and was unsuccessful (NVIDIA/gpu-operator#542). I think perhaps it is impossible to update prebuilt drivers (NVIDIA/gpu-operator#525). We are pretty ensconced in AWS, wondering if there even is a solution there... maybe running https://aws.amazon.com/marketplace/pp/prodview-h3v6xvwe36v74 as you suggest?

fwiw other AWS AMIs do run 510 driver versions, but my understanding is that these don't come with EKS support.

Okay, step 1 was was clearly critical - working now. Thanks for the useful thread!

We plan to upgrade the NVIDIA drivers in our EKS Optimized Accelerated AMI to the newer 525 series with a future Kubernetes version. For customers who want to stay on older Kubernetes versions, we will also provide a way of upgrading the NVIDIA drivers with the existing Accelerated AMI via documentation.

We plan to upgrade the NVIDIA drivers in our EKS Optimized Accelerated AMI to the newer 525 series with a future Kubernetes version. For customers who want to stay on older Kubernetes versions, we will also provide a way of upgrading the NVIDIA drivers with the existing Accelerated AMI via documentation.

Can you share a link to this documentation? What we're seeing on EKS 1.25 is that depending on the node being used, the version of NVIDIA drivers are different between them. So I am not convinced it's related to just EKS AMI, unless I am not understanding something.

If needed, you can run the following on the EKS GPU AMI to install a newer driver, just provide the driver intended driver version:

# Versions
# Driver 525.125.06 / CUDA 12.0
# Driver 535.54.03 / CUDA 12.2

# DRIVER=525.125.06
DRIVER=535.54.03

sudo yum install gcc10 -y
sudo wget -O /tmp/NVIDIA-Linux-driver.run "https://us.download.nvidia.com/tesla/${DRIVER}/NVIDIA-Linux-x86_64-${DRIVER}.run"
sudo CC=gcc10-cc sh /tmp/NVIDIA-Linux-driver.run -q -a --ui=none

You could do this in the user data and install it during instance startup. However, this adds a bit of time to instance startup. Instead, launch a standalone EC2 using the EKS GPU AMI (you don't need to supply the cluster bootstrap script, its not meant to connect to a cluster at this time), run the commands above, and then create a snapshot from the instance to create an AMI for use in your nodegroups

โš ๏ธ This information is provided to help folks install their own drivers and devices. You should thoroughly test and validate before deploying your workload. The configuration/guidance provided is not part of an AWS service and support is provided as best-effort by the maintainers. As stated here, official EKS support for newer drivers and devices will come on a future Kubernetes version of EKS

here is an initial Packer configuration to build an EKS AMI for use with NVIDIA GPUs - this is suitable for P5 instances as well https://github.com/clowdhaus/amazon-eks-gpu-ami

This will be moving over to https://github.com/aws-samples/amazon-eks-custom-amis this week

โš ๏ธ This information is provided to help folks install their own drivers and devices. You should thoroughly test and validate before deploying your workload. The configuration/guidance provided is not part of an AWS service and support is provided as best-effort by the maintainers. As stated here, official EKS support for newer drivers and devices will come on a future Kubernetes version of EKS