NVIDIA/k8s-device-plugin

"No devices found. Waiting indefinitely" should crash the pod

Opened this issue · 0 comments

I'm facing this sporadic issue on amazon EC2.

The nvidia device ds (image nvcr.io/nvidia/k8s-device-plugin:v0.17.0) silently fails with these logs:

 "Starting NVIDIA Device Plugin" version=<
        d475b2cf
        commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
 >
I1129 11:18:42.706636       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1129 11:18:42.706683       1 main.go:245] Starting OS watcher.
I1129 11:18:42.706950       1 main.go:260] Starting Plugins.
I1129 11:18:42.706983       1 main.go:317] Loading configuration.
I1129 11:18:42.707810       1 main.go:342] Updating config with default resource matching patterns.
I1129 11:18:42.708036       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I1129 11:18:42.708051       1 main.go:356] Retrieving plugins.
I1129 11:18:46.704792       1 main.go:381] No devices found. Waiting indefinitely.

Now, on the outside the pod still appear as 'Running' and produces no failing events. I only realize this is happening due to my workloads not scheduling properly (e.g. Karpenter's nodeclaims noticing the requested resources are not there).

Monitoring wise I could probably set up something to watch the logs and alert us, but I'd like to challenge the 'Waiting indefinitely' part of the log. If the container was to send an error signal to k8s, an event, or just fall in a CrashLoopBack cycle, it would be 1. easier to detect and 2. more idiomatic to normal k8s workflows.

I imagine there's a reason behind the decision to wait, but it would be great to have a retry mechanism and / or a failure event.