awslabs/amazon-eks-ami

Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227

Closed this issue ยท 16 comments

What happened:

I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group's AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh'd into the node and found nvidia-smi couldn't communicate with the GPU:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

What you expected to happen:

Should be able to communicate with the Tesla GPU without manual intervention

How to reproduce it (as minimally and precisely as possible):

Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.

Anything else we need to know?:

Environment:

  • AWS Region: us-east-2
  • Instance Type(s): p3.2xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
  • AMI Version: amazon-eks-gpu-node-1.29-v20240227
  • Kernel (e.g. uname -a): Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0e3ec26ca86336aea"
BUILD_TIME="Tue Feb 27 23:54:40 UTC 2024"
BUILD_KERNEL="5.10.209-198.858.amzn2.x86_64"
ARCH="x86_64"

Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:

sudo rm -r /var/lib/dkms/nvidia
sudo dkms install nvidia/535.161.07 --force

Then if I run nvidia-smi I get:

Fri Mar  1 04:33:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P0              38W / 300W |      0MiB / 16384MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

This instance type is being incorrectly detected as supporting the open-source NVIDIA kernel module, and the wrong kmod is loaded as a result. I have a fix out for review and it will land in the next AMI release.

After you've force-loaded the proprietary kmod, do you see any issues with your workloads? Feel free to open an AWS Support case if you can't share the details here, I'll track it down. ๐Ÿ‘

Thanks, @cartermckinnon. After I force-load the NVIDIA kernel module, everything appears to behave normally. I'm going to roll back to the previous AMI though so I won't have exhaustive insight into the stability of the modified image.

same issue here.
@cartermckinnon any info when new version will be released?

This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using?

@cartermckinnon
amazon/amazon-eks-gpu-node-1.29-v20240307 (ami-031e889e75cb38be6).
nvidia-device-plugin image is k8s-device-plugin:v0.14.5 and the host is g4dn.2xlarge.

The error

Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0314 17:11:50.667852       1 main.go:256] Retreiving plugins.
W0314 17:11:50.668199       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0314 17:11:50.668243       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0314 17:11:50.668272       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0314 17:11:50.668282       1 factory.go:115] Incompatible platform detected
E0314 17:11:50.668287       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0314 17:11:50.668291       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0314 17:11:50.668294       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0314 17:11:50.668301       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0314 17:11:50.672879       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

I've run into the same issue as @korjek. I'm on EKS 1.29 with AMI amazon-eks-gpu-node-1.29-v20240307.

It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the configure-nvidia.service and its corresponding script then tried to run it which gave me this output.

/etc/eks/configure-nvidia.sh
+ gpu-ami-util has-nvidia-devices
true
+ /etc/eks/nvidia-kmod-load.sh
true
0x2237 NVIDIA A10G
Disabling GSP for instance type: g5.xlarge
2024-03-15T21:59:42+0000 [kmod-util] unpacking: nvidia-open
Error! nvidia-open-535.161.07 is already added!
Aborting.

As a workaround, I patched /etc/eks/configure-nvidia.sh in my Karpenter userdata like so.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ... bunch of other stuff
  userData: |
    cat <<EOF > /etc/eks/configure-nvidia.sh
    #!/usr/bin/env bash

    set -o errexit
    set -o nounset
    set -o xtrace

    if ! gpu-ami-util has-nvidia-devices; then
      echo >&2 "no NVIDIA devices are present, nothing to do!"
      exit 0
    fi

    # patched with "|| true" to avoid failing on startup
    /etc/eks/nvidia-kmod-load.sh || true

    # add 'nvidia' runtime to containerd config, and set it as the default
    # otherwise, all Pods need to speciy the runtimeClassName
    nvidia-ctk runtime configure --runtime=containerd --set-as-default
    EOF

Can you grab the logs from the initial execution of the script? journalctl -u configure-nvidia

I did more testing and found that my workaround accidentally and incorrectly fixes the issue.

What I think is really happening is that the configure-nvidia.service is completing before the bootstrap.sh process gets to this code here.

  if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then
    sudo cp -v /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml
    sudo cp -v /etc/eks/containerd/sandbox-image.service /etc/systemd/system/sandbox-image.service
    sudo chown root:root /etc/systemd/system/sandbox-image.service
    systemctl daemon-reload
    systemctl enable containerd sandbox-image
    systemctl restart sandbox-image containerd
  fi

The configure-nvidia service sets nvidia as the runtime in /etc/containerd/config.toml, but because it finishes before the bootstrap process, the bootstrap process overwrites this file because its different than /etc/eks/containerd/containerd-config.toml.

So how does my workaround "fix" this? From what I can tell the || true actually causes the configure-nvidia service to fail after its first startup (not sure why exactly). Then configure-nvidia is started again (not sure how either). This happens when the sandbox-image service is being restarted in the bootstrap process. And because sandbox-image service takes a bit to restart, the new containerd config is in place by now, and finally the containerd service is restarted right after sandbox-image in the bootstrap process.

I've stitched together logs from my observations. The "current.txt" doesn't have my extra userdata.

This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.

Note I'm setting discard_unpacked_layers to false for my use case which helps with making sure the ! cmp -s ... in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for the sandbox_image, which I think would need to be updated based on these docs.

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ...
  userData: |
    cat <<EOF > /etc/eks/containerd/containerd-config.toml
    imports = ["/etc/containerd/config.d/*.toml"]
    root = "/var/lib/containerd"
    state = "/run/containerd"
    version = 2

    [grpc]
      address = "/run/containerd/containerd.sock"

    [plugins]

      [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

        [plugins."io.containerd.grpc.v1.cri".cni]
          bin_dir = "/opt/cni/bin"
          conf_dir = "/etc/cni/net.d"

        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
          discard_unpacked_layers = false

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
                SystemdCgroup = true

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".registry]
          config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
    EOF

Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?

We also seem to be experiencing the same issue with amazon-eks-gpu-node-1.29-v20240307. Do we know when a new AMI that addresses this will be released?

This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.

Note I'm setting discard_unpacked_layers to false for my use case which helps with making sure the ! cmp -s ... in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for the sandbox_image, which I think would need to be updated based on these docs.

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ...
  userData: |
    cat <<EOF > /etc/eks/containerd/containerd-config.toml
    imports = ["/etc/containerd/config.d/*.toml"]
    root = "/var/lib/containerd"
    state = "/run/containerd"
    version = 2

    [grpc]
      address = "/run/containerd/containerd.sock"

    [plugins]

      [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

        [plugins."io.containerd.grpc.v1.cri".cni]
          bin_dir = "/opt/cni/bin"
          conf_dir = "/etc/cni/net.d"

        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
          discard_unpacked_layers = false

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
                SystemdCgroup = true

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".registry]
          config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
    EOF

Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?

They are not open source, though they are built off this repo and modified by AWS internally as far as they have communicated.

Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?

The GPU AMI template is not open source at the moment; but you can always use an existing GPU AMI as a base image in a Packer template if you want to apply a patched bootstrap.sh.

This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.

Yep, this should work for now. I intend to have a proper fix out in the next AMI release.

I have hit the same issue with latest EKS AMI where NVIDIA Device plugin(amazon/amazon-eks-gpu-node-1.29-v20240329) throws the following error when deployed GPU ndoe with Karpenter

I0404 16:49:05.428673       1 main.go:279] Retrieving plugins.
W0404 16:49:05.428726       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0404 16:49:05.428788       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0404 16:49:05.428819       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0404 16:49:05.428823       1 factory.go:112] Incompatible platform detected
E0404 16:49:05.428826       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0404 16:49:05.428829       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0404 16:49:05.428831       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0404 16:49:05.428834       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0404 16:49:05.448543       1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
``

I have hit the same issue. The NVIDIA GPU Operator(v23.9.1) appears to function correctly with the latest GPU AMI (amazon/amazon-eks-gpu-node-1.29-v20240329).

I resolved the problem by removing the NVIDIA Device Plugin(v0.15.0-rc.2) and relying solely on the NVIDIA GPU Operator.

Hopefully, a new patch of EKS AMI should resolve the issue with NVIDIA Device plugin.

We're also seeing the same issue, we're using v0.13.0 of the nvidia-device-plugin.

Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between configure-nvidia.service and bootstrap.sh) should be resolved in the latest AMI release, v20240409. ๐Ÿ‘