OutOfnvidia.com/gpu when node is restarted
driosalido opened this issue · 4 comments
Trying to run a sample replica-set that start pods with a gpu request to test our installation, we discover that if we restart the node that run the pod, the pod enter a OutOfnvidia.com/gpu
state that seems to last forever.
Is this the normal behaviour when the resource is lost?
kubectl get pods -o wide | grep replicaset
gpu-replicaset-bw4gf 1/1 Running 0 2h 10.2.2.9 controller-1.k8s.ml.prod.srcd.host <none>
gpu-replicaset-bxbc2 0/1 OutOfnvidia.com/gpu 0 2h <none> controller-2.k8s.ml.prod.srcd.host <none>
gpu-replicaset-n8srl 1/1 Running 0 2h 10.2.0.9 controller-3.k8s.ml.prod.srcd.host <none>
gpu-replicaset-spvwb 1/1 Running 0 2h 10.2.1.6 controller-2.k8s.ml.prod.srcd.host <none>
cc @jiayingz
I have discovered the same problem in the GCE but with https://github.com/NVIDIA/k8s-device-plugin after Instance terminated during maintenance
operation appeared on the https://console.cloud.google.com/compute/operations page.
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202 --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=5230 /var/lib/docker/overlay2/1319de685e623721c29cb7d6fe75ccca064c2be4b791952a545eab84829d5d83/merged]\\\\nnvidia-container-cli: device error: unknown device id: GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202\\\\n\\\"\"": unknown
Exit Code: 128
Started: Sat, 02 Mar 2019 06:59:44 +0000
Finished: Sat, 02 Mar 2019 06:59:44 +0000
Ready: False
Restart Count: 569
It seems that after node termination device id has been changed (GPU has been replaced after maintenance)
I am having the same issue now, I have used preemtible VMs and attached Tesla T4 GPU to it.
We're having the same issue with preemtible nodes and attached Tesla V100 GPUs.