awslabs/amazon-eks-ami

EKS GPU Pod crashes

Closed this issue · 1 comments

What happened:
I am trying to bring up a GPU node group, and have been following instructions at
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami?icmpid=docs_console_unmapped
to setup a GPU cluster
What you expected to happen:
Pod should not crash
How to reproduce it (as minimally and precisely as possible):

  1. Setup a basic GPU cluster, with AL2_x86_64_GPU , and g4dn.xlarge instances

  2. Deployed the NVIDIA device plugin for Kubernetes as a DaemonSet. I choose https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml

  3. When you run , kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", the GPU is recognized.

  4. Tried to deploy a pod with the following, to match CUDA Version: 12.2 , running on the VM:

         apiVersion: v1
         kind: Pod
         metadata:
           name: gpu-test-nvidia-smi
         spec:
           restartPolicy: OnFailure
           containers:
             - name: gpu-demo
               image: nvidia/cuda:12.2.2-devel-ubuntu22.04
               command: ['/bin/sh', '-c']
               args: ['nvidia-smi && tail -f /dev/null']
               resources:
                 limits:
                   nvidia.com/gpu: 1
           tolerations:
             - key: 'nvidia.com/gpu'
               operator: 'Equal'
               value: 'true'
               effect: 'NoSchedule'
    
  5. The pod is created, but logs show /bin/sh: 1: nvidia-smi: not found, and crashes.

  6. Tried with public.ecr.aws/amazonlinux/amazonlinux:2023-minimal as well, and same issue.
    ~
    ~
    Anything else we need to know?:

Environment:

  • AWS Region: us-west-2
  • Instance Type(s): g4dn.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.6"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.30
  • AMI Version:
  • Kernel (e.g. uname -a): Linux ip-XXXXus-west-2.compute.internal 5.10.223-212.873.amzn2.x86_64
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-07b6bae8c66d9f05c"
BUILD_TIME="Wed Aug 28 02:21:41 UTC 2024"
BUILD_KERNEL="5.10.223-212.873.amzn2.x86_64"
ARCH="x86_64"

We would need to look at the logs on your node to figure out what's going on, can you open a case with AWS Support?