EKS GPU Pod crashes

Question

EKS GPU Pod crashes

Closed this issue 4 months ago · 1 comments

What happened:
I am trying to bring up a GPU node group, and have been following instructions at
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami?icmpid=docs_console_unmapped
to setup a GPU cluster
What you expected to happen:
Pod should not crash
How to reproduce it (as minimally and precisely as possible):

Setup a basic GPU cluster, with AL2_x86_64_GPU , and g4dn.xlarge instances
Deployed the NVIDIA device plugin for Kubernetes as a DaemonSet. I choose https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml
When you run , kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", the GPU is recognized.

Tried to deploy a pod with the following, to match CUDA Version: 12.2 , running on the VM:

     apiVersion: v1
     kind: Pod
     metadata:
       name: gpu-test-nvidia-smi
     spec:
       restartPolicy: OnFailure
       containers:
         - name: gpu-demo
           image: nvidia/cuda:12.2.2-devel-ubuntu22.04
           command: ['/bin/sh', '-c']
           args: ['nvidia-smi && tail -f /dev/null']
           resources:
             limits:
               nvidia.com/gpu: 1
       tolerations:
         - key: 'nvidia.com/gpu'
           operator: 'Equal'
           value: 'true'
           effect: 'NoSchedule'

The pod is created, but logs show /bin/sh: 1: nvidia-smi: not found, and crashes.
Tried with public.ecr.aws/amazonlinux/amazonlinux:2023-minimal as well, and same issue.
~
~
Anything else we need to know?:

Environment:

AWS Region: us-west-2
Instance Type(s): g4dn.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.6"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.30
AMI Version:
Kernel (e.g. uname -a): Linux ip-XXXXus-west-2.compute.internal 5.10.223-212.873.amzn2.x86_64
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-07b6bae8c66d9f05c"
BUILD_TIME="Wed Aug 28 02:21:41 UTC 2024"
BUILD_KERNEL="5.10.223-212.873.amzn2.x86_64"
ARCH="x86_64"

Answer 1 · 2024-09-06T19:10:21.000Z

We would need to look at the logs on your node to figure out what's going on, can you open a case with AWS Support?