EKS GPU Pod crashes
Closed this issue · 1 comments
What happened:
I am trying to bring up a GPU node group, and have been following instructions at
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami?icmpid=docs_console_unmapped
to setup a GPU cluster
What you expected to happen:
Pod should not crash
How to reproduce it (as minimally and precisely as possible):
-
Setup a basic GPU cluster, with AL2_x86_64_GPU , and g4dn.xlarge instances
-
Deployed the NVIDIA device plugin for Kubernetes as a DaemonSet. I choose https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml
-
When you run ,
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
, the GPU is recognized. -
Tried to deploy a pod with the following, to match CUDA Version: 12.2 , running on the VM:
apiVersion: v1 kind: Pod metadata: name: gpu-test-nvidia-smi spec: restartPolicy: OnFailure containers: - name: gpu-demo image: nvidia/cuda:12.2.2-devel-ubuntu22.04 command: ['/bin/sh', '-c'] args: ['nvidia-smi && tail -f /dev/null'] resources: limits: nvidia.com/gpu: 1 tolerations: - key: 'nvidia.com/gpu' operator: 'Equal' value: 'true' effect: 'NoSchedule'
-
The pod is created, but logs show
/bin/sh: 1: nvidia-smi: not found
, and crashes. -
Tried with public.ecr.aws/amazonlinux/amazonlinux:2023-minimal as well, and same issue.
~
~
Anything else we need to know?:
Environment:
- AWS Region: us-west-2
- Instance Type(s): g4dn.xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): "eks.6" - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.30 - AMI Version:
- Kernel (e.g.
uname -a
): Linux ip-XXXXus-west-2.compute.internal 5.10.223-212.873.amzn2.x86_64 - Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-07b6bae8c66d9f05c"
BUILD_TIME="Wed Aug 28 02:21:41 UTC 2024"
BUILD_KERNEL="5.10.223-212.873.amzn2.x86_64"
ARCH="x86_64"
We would need to look at the logs on your node to figure out what's going on, can you open a case with AWS Support?