Differences in installation results depending on the version of the aws-eks-gpu-node AMI image

Question

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image

Closed this issue 8 months ago · 1 comments

I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.

If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.

When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ journalctl -u configure-nvidia.service
~~
 4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
 4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
 4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
 4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.

The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.

$ cat /etc/systemd/system/configure-nvidia.service
[Unit]
Description=Configure NVIDIA instance types
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.

[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.

Environment:

AWS Region: ap-northeast-2
Instance Type(s): g4dn.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.11
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): v1.25 and v1.28
AMI Version: aws-eks-gpu-node-1.25 / aws-eks-gpu-node-1.28
Kernel (e.g. uname -a):
- Linux ip-172-31-13-206.ap-northeast-2.compute.internal 5.10.210-201.855.amzn2.x86_64 #1 SMP Tue Mar 12 19:03:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):
- BASE_AMI_ID="ami-09bffa74b1e396075"
  BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
  BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
  ARCH="x86_64"

Answer 1 · 2024-04-11T18:31:50.000Z

This is fixed in the latest release 👍