awslabs/amazon-eks-ami

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image

Closed this issue · 1 comments

I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.

If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.

When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ journalctl -u configure-nvidia.service
~~
 4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
 4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
 4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
 4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.

The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.

$ cat /etc/systemd/system/configure-nvidia.service
[Unit]
Description=Configure NVIDIA instance types
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service


The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.

[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.

Environment:

  • AWS Region: ap-northeast-2
  • Instance Type(s): g4dn.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.11
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): v1.25 and v1.28
  • AMI Version: aws-eks-gpu-node-1.25 / aws-eks-gpu-node-1.28
  • Kernel (e.g. uname -a):
    • Linux ip-172-31-13-206.ap-northeast-2.compute.internal 5.10.210-201.855.amzn2.x86_64 #1 SMP Tue Mar 12 19:03:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
    • BASE_AMI_ID="ami-09bffa74b1e396075"
      BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
      BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
      ARCH="x86_64"

This is fixed in the latest release 👍