Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227
Closed this issue ยท 16 comments
What happened:
I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group's AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh'd into the node and found nvidia-smi
couldn't communicate with the GPU:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
What you expected to happen:
Should be able to communicate with the Tesla GPU without manual intervention
How to reproduce it (as minimally and precisely as possible):
Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.
Anything else we need to know?:
Environment:
- AWS Region: us-east-2
- Instance Type(s): p3.2xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.1 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.29 - AMI Version: amazon-eks-gpu-node-1.29-v20240227
- Kernel (e.g.
uname -a
):Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-0e3ec26ca86336aea"
BUILD_TIME="Tue Feb 27 23:54:40 UTC 2024"
BUILD_KERNEL="5.10.209-198.858.amzn2.x86_64"
ARCH="x86_64"
Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:
sudo rm -r /var/lib/dkms/nvidia
sudo dkms install nvidia/535.161.07 --force
Then if I run nvidia-smi
I get:
Fri Mar 1 04:33:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 |
| N/A 24C P0 38W / 300W | 0MiB / 16384MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
This instance type is being incorrectly detected as supporting the open-source NVIDIA kernel module, and the wrong kmod is loaded as a result. I have a fix out for review and it will land in the next AMI release.
After you've force-loaded the proprietary kmod, do you see any issues with your workloads? Feel free to open an AWS Support case if you can't share the details here, I'll track it down. ๐
Thanks, @cartermckinnon. After I force-load the NVIDIA kernel module, everything appears to behave normally. I'm going to roll back to the previous AMI though so I won't have exhaustive insight into the stability of the modified image.
same issue here.
@cartermckinnon any info when new version will be released?
This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using?
@cartermckinnon
amazon/amazon-eks-gpu-node-1.29-v20240307
(ami-031e889e75cb38be6).
nvidia-device-plugin image is k8s-device-plugin:v0.14.5
and the host is g4dn.2xlarge
.
The error
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0314 17:11:50.667852 1 main.go:256] Retreiving plugins.
W0314 17:11:50.668199 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0314 17:11:50.668243 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0314 17:11:50.668272 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0314 17:11:50.668282 1 factory.go:115] Incompatible platform detected
E0314 17:11:50.668287 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0314 17:11:50.668291 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0314 17:11:50.668294 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0314 17:11:50.668301 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0314 17:11:50.672879 1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
I've run into the same issue as @korjek. I'm on EKS 1.29 with AMI amazon-eks-gpu-node-1.29-v20240307
.
It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the configure-nvidia.service
and its corresponding script then tried to run it which gave me this output.
/etc/eks/configure-nvidia.sh
+ gpu-ami-util has-nvidia-devices
true
+ /etc/eks/nvidia-kmod-load.sh
true
0x2237 NVIDIA A10G
Disabling GSP for instance type: g5.xlarge
2024-03-15T21:59:42+0000 [kmod-util] unpacking: nvidia-open
Error! nvidia-open-535.161.07 is already added!
Aborting.
As a workaround, I patched /etc/eks/configure-nvidia.sh
in my Karpenter userdata like so.
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: nvidia-a10g
spec:
# ... bunch of other stuff
userData: |
cat <<EOF > /etc/eks/configure-nvidia.sh
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o xtrace
if ! gpu-ami-util has-nvidia-devices; then
echo >&2 "no NVIDIA devices are present, nothing to do!"
exit 0
fi
# patched with "|| true" to avoid failing on startup
/etc/eks/nvidia-kmod-load.sh || true
# add 'nvidia' runtime to containerd config, and set it as the default
# otherwise, all Pods need to speciy the runtimeClassName
nvidia-ctk runtime configure --runtime=containerd --set-as-default
EOF
Can you grab the logs from the initial execution of the script? journalctl -u configure-nvidia
I did more testing and found that my workaround accidentally and incorrectly fixes the issue.
What I think is really happening is that the configure-nvidia.service
is completing before the bootstrap.sh process gets to this code here.
if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then
sudo cp -v /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml
sudo cp -v /etc/eks/containerd/sandbox-image.service /etc/systemd/system/sandbox-image.service
sudo chown root:root /etc/systemd/system/sandbox-image.service
systemctl daemon-reload
systemctl enable containerd sandbox-image
systemctl restart sandbox-image containerd
fi
The configure-nvidia service sets nvidia as the runtime in /etc/containerd/config.toml
, but because it finishes before the bootstrap process, the bootstrap process overwrites this file because its different than /etc/eks/containerd/containerd-config.toml
.
So how does my workaround "fix" this? From what I can tell the || true
actually causes the configure-nvidia service to fail after its first startup (not sure why exactly). Then configure-nvidia is started again (not sure how either). This happens when the sandbox-image service is being restarted in the bootstrap process. And because sandbox-image service takes a bit to restart, the new containerd config is in place by now, and finally the containerd service is restarted right after sandbox-image in the bootstrap process.
I've stitched together logs from my observations. The "current.txt" doesn't have my extra userdata.
This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml
knowing that the bootstrap process uses it.
Note I'm setting discard_unpacked_layers
to false for my use case which helps with making sure the ! cmp -s ...
in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for the sandbox_image
, which I think would need to be updated based on these docs.
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: nvidia-a10g
spec:
# ...
userData: |
cat <<EOF > /etc/eks/containerd/containerd-config.toml
imports = ["/etc/containerd/config.d/*.toml"]
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2
[grpc]
address = "/run/containerd/containerd.sock"
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
discard_unpacked_layers = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
EOF
Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?
We also seem to be experiencing the same issue with amazon-eks-gpu-node-1.29-v20240307
. Do we know when a new AMI that addresses this will be released?
This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to
/etc/eks/containerd/containerd-config.toml
knowing that the bootstrap process uses it.Note I'm setting
discard_unpacked_layers
to false for my use case which helps with making sure the! cmp -s ...
in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for thesandbox_image
, which I think would need to be updated based on these docs.--- apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: nvidia-a10g spec: # ... userData: | cat <<EOF > /etc/eks/containerd/containerd-config.toml imports = ["/etc/containerd/config.d/*.toml"] root = "/var/lib/containerd" state = "/run/containerd" version = 2 [grpc] address = "/run/containerd/containerd.sock" [plugins] [plugins."io.containerd.grpc.v1.cri"] sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5" [plugins."io.containerd.grpc.v1.cri".cni] bin_dir = "/opt/cni/bin" conf_dir = "/etc/cni/net.d" [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" discard_unpacked_layers = false [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" SystemdCgroup = true [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true [plugins."io.containerd.grpc.v1.cri".registry] config_path = "/etc/containerd/certs.d:/etc/docker/certs.d" EOFIdeally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?
They are not open source, though they are built off this repo and modified by AWS internally as far as they have communicated.
Ideally I'd be able to build a GPU AMI with a modified bootstrap.sh script, but I can't figure out where the GPU AMIs are coming from. Doesn't seem like they're open source?
The GPU AMI template is not open source at the moment; but you can always use an existing GPU AMI as a base image in a Packer template if you want to apply a patched bootstrap.sh
.
This is probably a better workaround for now. Basically I'm taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.
Yep, this should work for now. I intend to have a proper fix out in the next AMI release.
I have hit the same issue with latest EKS AMI where NVIDIA Device plugin(amazon/amazon-eks-gpu-node-1.29-v20240329
) throws the following error when deployed GPU ndoe with Karpenter
I0404 16:49:05.428673 1 main.go:279] Retrieving plugins.
W0404 16:49:05.428726 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0404 16:49:05.428788 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0404 16:49:05.428819 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0404 16:49:05.428823 1 factory.go:112] Incompatible platform detected
E0404 16:49:05.428826 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0404 16:49:05.428829 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0404 16:49:05.428831 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0404 16:49:05.428834 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0404 16:49:05.448543 1 main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
``
I have hit the same issue. The NVIDIA GPU Operator(v23.9.1)
appears to function correctly with the latest GPU AMI (amazon/amazon-eks-gpu-node-1.29-v20240329
).
I resolved the problem by removing the NVIDIA Device Plugin(v0.15.0-rc.2)
and relying solely on the NVIDIA GPU Operator.
Hopefully, a new patch of EKS AMI should resolve the issue with NVIDIA Device plugin.
We're also seeing the same issue, we're using v0.13.0
of the nvidia-device-plugin.
Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between configure-nvidia.service
and bootstrap.sh
) should be resolved in the latest AMI release, v20240409. ๐