Metrics export broken due to device naming mismatch
Opened this issue · 0 comments
igorwwwwwwwwwwwwwwwwwwww commented
We're getting this error repeatedly in the logs, and no GPU metrics being exported. This is on GKE:
GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I suspect it may be an issue with device names (nvidia0
vs nvidia0/gi0
), but I'm not entirely sure.
Full logs:
➜ ~ k -n kube-system logs nvidia-gpu-device-plugin-4rcj2
I0518 15:07:21.825159 1 nvidia_gpu.go:75] device-plugin started
I0518 15:07:21.825230 1 nvidia_gpu.go:82] Reading GPU config file: /etc/nvidia/gpu_config.json
I0518 15:07:21.825362 1 nvidia_gpu.go:91] Using gpu config: {7g.40gb 0 { 0} []}
E0518 15:07:27.545855 1 nvidia_gpu.go:117] failed to start GPU device manager: failed to start mig device manager: Number of partitions (0) for GPU 0 does not match expected partition count (1)
I0518 15:07:32.547969 1 mig.go:175] Discovered GPU partition: nvidia0/gi0
I0518 15:07:32.549461 1 nvidia_gpu.go:122] Starting metrics server on port: 2112, endpoint path: /metrics, collection frequency: 30000
I0518 15:07:32.550354 1 metrics.go:134] Starting metrics server
I0518 15:07:32.550430 1 metrics.go:140] nvml initialized successfully. Driver version: 470.161.03
I0518 15:07:32.550446 1 devices.go:115] Found 1 GPU devices
I0518 15:07:32.556369 1 devices.go:126] Found device nvidia0 for metrics collection
I0518 15:07:32.556430 1 health_checker.go:65] Starting GPU Health Checker
I0518 15:07:32.556440 1 health_checker.go:68] Healthchecker receives device nvidia0/gi0, device {nvidia0/gi0 Healthy nil {} 0}+
I0518 15:07:32.556475 1 health_checker.go:77] Found 1 GPU devices
I0518 15:07:32.556667 1 health_checker.go:145] HealthChecker detects MIG is enabled on device nvidia0
I0518 15:07:32.560599 1 health_checker.go:164] Found mig device nvidia0/gi0 for health monitoring. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561030 1 health_checker.go:113] Registering device /dev/nvidia0. UUID: MIG-7f562a1e-1c4c-5334-aff0-c679f8b6bc29
I0518 15:07:32.561195 1 manager.go:385] will use alpha API
I0518 15:07:32.561206 1 manager.go:399] starting device-plugin server at: /device-plugin/nvidiaGPU-1684422452.sock
I0518 15:07:32.561393 1 manager.go:426] device-plugin server started serving
I0518 15:07:32.564986 1 beta_plugin.go:40] device-plugin: ListAndWatch start
I0518 15:07:32.565040 1 manager.go:434] device-plugin registered with the kubelet
I0518 15:07:32.565003 1 beta_plugin.go:138] ListAndWatch: send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:nvidia0/gi0,Health:Healthy,Topology:nil,},},}
E0518 15:08:02.557919 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:02.643081 1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:08:32.557732 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
I0518 15:08:32.683032 1 metrics.go:217] Error calculating duty cycle for device: nvidia0: Failed to get dutyCycle: failed to get GPU utilization for device GPU-4f8b874c-22da-69ad-2516-32c3a568d707, nvml return code: 3. Skipping this device
E0518 15:09:02.557651 1 metrics.go:200] Failed to get device for nvidia0/gi0: device nvidia0/gi0 not found
Downstream issue: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/76.