GoogleCloudPlatform/container-engine-accelerators

OutOfnvidia.com/gpu when node is restarted

driosalido opened this issue · 4 comments

Trying to run a sample replica-set that start pods with a gpu request to test our installation, we discover that if we restart the node that run the pod, the pod enter a OutOfnvidia.com/gpu state that seems to last forever.

Is this the normal behaviour when the resource is lost?

kubectl get pods -o wide | grep replicaset
gpu-replicaset-bw4gf                    1/1     Running               0          2h    10.2.2.9      controller-1.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-bxbc2                    0/1     OutOfnvidia.com/gpu   0          2h    <none>        controller-2.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-n8srl                    1/1     Running               0          2h    10.2.0.9      controller-3.k8s.ml.prod.srcd.host   <none>
gpu-replicaset-spvwb                    1/1     Running               0          2h    10.2.1.6      controller-2.k8s.ml.prod.srcd.host   <none>

I have discovered the same problem in the GCE but with https://github.com/NVIDIA/k8s-device-plugin after Instance terminated during maintenance operation appeared on the https://console.cloud.google.com/compute/operations page.

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202 --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=5230 /var/lib/docker/overlay2/1319de685e623721c29cb7d6fe75ccca064c2be4b791952a545eab84829d5d83/merged]\\\\nnvidia-container-cli: device error: unknown device id: GPU-3187374f-4d6c-93e6-9c79-c5567b3d9202\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Sat, 02 Mar 2019 06:59:44 +0000
      Finished:     Sat, 02 Mar 2019 06:59:44 +0000
    Ready:          False
    Restart Count:  569

It seems that after node termination device id has been changed (GPU has been replaced after maintenance)

I am having the same issue now, I have used preemtible VMs and attached Tesla T4 GPU to it.

We're having the same issue with preemtible nodes and attached Tesla V100 GPUs.