GoogleCloudPlatform/container-engine-accelerators

Request to provide Dockerfile source code for Nvidia driver installation on COS

Opened this issue · 2 comments

Would it be possible for repo maintainers to provide the Dockerfile and any scripts used to generate the image by this daemonset? https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-nvidia-v450.yaml

The reason for this request is, I'd like to install a specific version (470.57.02) of Nvidia drivers on a GKE cluster running container-optimized OS with containerd. The official GKE documentation provides this daemonset, which installs an older driver version. I assume daemonset-nvidia-v450.yaml in this repo can be modified to install a specific driver, by changing this line to an appropriate image:

      - image: gcr.io/cos-cloud/cos-gpu-installer@sha256:93f1abf0d6a27e14bebf43ffb00b8d819b20f6027012ad73306ba670bcac6c83

However, I cannot find the source code for this image, so it is not clear how I can install a different Nvidia driver version.

For example, for GKE ubuntu images, this repo provides the Dockerfile and entrypoint.sh source code. Would it be possible to share the COS equivalent?

If you set the image to gcr.io/cos-cloud/cos-gpu-installer:latest and set an NVIDIA_DRIVER_VERSION environment variable to the driver version you want it should work. Works for me with 470.82.01.

Also, I don't think the entrypoint and Dockerfile for Ubuntu are valid anymore. I've attempted the install steps in the script manually on an Ubuntu node and it doesn't work.

Any updates on this?

Our cluster is working on ubuntu, we need to know how to install specific version of cuda on ubuntu nodes. Tried setting the env, but it still fails.