GoogleCloudPlatform/container-engine-accelerators

Metrics export broken due to device naming mismatch

Opened this issue · 0 comments

We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:

GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

I suspect it may be an issue with device names (nvidia0 vs nvidia0/gi0), but I'm not entirely sure.

Full logs:

➜  ~ k -n kube-system logs nvidia-gpu-device-plugin-4rcj2
I0518 15:07:21.825159       1 nvidia_gpu.go:75] device-plugin started
I0518 15:07:21.825230       1 nvidia_gpu.go:82] Reading GPU config file: /etc/nvidia/gpu_config.json
I0518 15:07:21.825362       1 nvidia_gpu.go:91] Using gpu config: {7g.40gb 0 { 0} []}
E0518 15:07:27.545855       1 nvidia_gpu.go:117] failed to start GPU device manager: failed to start mig device manager: Number of partitions (0) for GPU 0 does not match expected partition count (1)
I0518 15:07:32.547969       1 mig.go:175] Discovered GPU partition: nvidia0/gi0
I0518 15:07:32.549461       1 nvidia_gpu.go:122] Starting metrics server on port: 2112, endpoint path: /metrics, collection frequency: 30000
I0518 15:07:32.550354       1 metrics.go:134] Starting metrics server
I0518 15:07:32.550430       1 metrics.go:140] nvml initialized successfully. Driver version: 470.161.03
I0518 15:07:32.550446       1 devices.go:115] Found 1 GPU devices
I0518 15:07:32.556369       1 devices.go:126] Found device nvidia0 for metrics collection
I0518 15:07:32.556430       1 health_checker.go:65] Starting GPU Health Checker
I0518 15:07:32.556440       1 health_checker.go:68] Healthchecker receives device nvidia0/gi0, device {nvidia0/gi0 Healthy nil {} 0}+
I0518 15:07:32.556475       1 health_checker.go:77] Found 1 GPU devices
I0518 15:07:32.556667       1 health_checker.go:145] HealthChecker detects MIG is enabled on device nvidia0
I0518 15:07:32.560599       1 health_checker.go:164] Found mig device nvidia0/gi0 for health monitoring. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561030       1 health_checker.go:113] Registering device /dev/nvidia0. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561195       1 manager.go:385] will use alpha API
I0518 15:07:32.561206       1 manager.go:399] starting device-plugin server at: /device-plugin/nvidiaGPU-1684422452.sock
I0518 15:07:32.561393       1 manager.go:426] device-plugin server started serving
I0518 15:07:32.564986       1 beta_plugin.go:40] device-plugin: ListAndWatch start
I0518 15:07:32.565040       1 manager.go:434] device-plugin registered with the kubelet
I0518 15:07:32.565003       1 beta_plugin.go:138] ListAndWatch: send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:nvidia0/gi0,Health:Healthy,Topology:nil,},},}
E0518 15:08:02.557919       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:02.643081       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:08:32.557732       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:32.683032       1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651       1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found

Downstream issue: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/76.