Differences in installation results depending on the version of the aws-eks-gpu-node AMI image
Closed this issue · 1 comments
I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.
If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.
When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ journalctl -u configure-nvidia.service
~~
4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.
The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.
$ cat /etc/systemd/system/configure-nvidia.service
[Unit]
Description=Configure NVIDIA instance types
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh
[Install]
WantedBy=multi-user.target docker.service containerd.service
The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.
[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh
[Install]
WantedBy=multi-user.target docker.service containerd.service
The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.
Environment:
- AWS Region: ap-northeast-2
- Instance Type(s): g4dn.xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.11 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): v1.25 and v1.28 - AMI Version: aws-eks-gpu-node-1.25 / aws-eks-gpu-node-1.28
- Kernel (e.g.
uname -a
):- Linux ip-172-31-13-206.ap-northeast-2.compute.internal 5.10.210-201.855.amzn2.x86_64 #1 SMP Tue Mar 12 19:03:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Release information (run
cat /etc/eks/release
on a node):- BASE_AMI_ID="ami-09bffa74b1e396075"
BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
ARCH="x86_64"
- BASE_AMI_ID="ami-09bffa74b1e396075"
This is fixed in the latest release 👍