utkuozdemir/nvidia_gpu_exporter

Failed to initialize NVML: Unknown Error

jangrewe opened this issue · 4 comments

Describe the bug
I'm running the current version of your Docker image, and it works most of the time - but sometimes it starts to fail, and i need to restart the container.
It sometimes runs for a whole day, and sometimes only a couple of minutes.

To Reproduce
Steps to reproduce the behavior:

  1. Systemd Unit ExecStart:
/usr/bin/docker run --name prometheus-nvidia-gpu-exporter \
  --gpus all \
  -p 9835:9835 \
  -v /dev/nvidiactl:/dev/nvidiactl \
  -v /dev/nvidia0:/dev/nvidia0 \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 \
  -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
  utkuozdemir/nvidia_gpu_exporter:1.2.0

Expected behavior
I'd expect the exporter to not start throwing errors ;-)

Console output
(Disregard the mismatching timestamps, i copypasta'd the error first, and then also added the initial log when starting the container.)

May 24 19:01:22 hades systemd[1]: Stopped Prometheus Nvidia GPU Exporter.
May 24 19:01:22 hades systemd[1]: Starting Prometheus Nvidia GPU Exporter...
May 24 19:01:22 hades docker[1915038]: prometheus-nvidia-gpu-exporter
May 24 19:01:23 hades docker[1915048]: 1.2.0: Pulling from utkuozdemir/nvidia_gpu_exporter
May 24 19:01:23 hades docker[1915048]: Digest: sha256:cc407f77ab017101ce233a0185875ebc75d2a0911381741b20ad91f695e488c7
May 24 19:01:23 hades docker[1915048]: Status: Image is up to date for utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades docker[1915048]: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades systemd[1]: Started Prometheus Nvidia GPU Exporter.
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:232 level=info msg="Listening on" address=[::]:9835
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:235 level=info msg="TLS is disabled." http2=false address=[::]:9835
[...]
May 24 19:00:45 hades docker[1903720]: ts=2023-05-24T17:00:45.428Z caller=exporter.go:184 level=error error="error running command: exit status 255: command failed. code: 255 | command: nvidia-smi --query-gpu=timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,fabric.state,fabric.status --format=csv | stdout: Failed to initialize NVML: Unknown Error\n | stderr: "

(The error from the title is at the end of this very long last line.)

Model and Version

  • GPU Model: RTX 4070 Ti
  • App version: 1.2.0 am64
  • Installation method: Docker image
  • Operating System: Debian 11/bullseye
  • Nvidia GPU driver version:

Running on Docker with Nvidia Container Toolkit:

$ docker info
Client: Docker Engine - Community
 Version:    24.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx

Server:
 Containers: 84
  Running: 83
  Paused: 0
  Stopped: 1
 Images: 87
 Server Version: 24.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-23-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 125.7GiB
 Docker Root Dir: /srv/docker
 Debug Mode: false
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true
$ dpkg -l | grep nvidia
ii  libnvidia-container-tools             1.13.1-1                                                                   amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.13.1-1                                                                   amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit              1.13.1-1                                                                   amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.13.1-1                                                                   amd64        NVIDIA Container Toolkit Base
$ nvidia-smi
Wed May 24 19:10:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:42:00.0 Off |                  N/A |
|  0%   56C    P2    34W / 285W |   5122MiB / 12282MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    937698      C   /usr/bin/zmc                      225MiB |
|    0   N/A  N/A   3332933      C   python3                          1838MiB |
|    0   N/A  N/A   3469008      C   python                           3056MiB |
+-----------------------------------------------------------------------------+

It seems the error on stdout is

Failed to initialize NVML: Unknown Error

When I Google the error, I find very similar issus such as:

Can you have a look at them? I don't think this is an issue with the exporter, because the exporter is just a dumb tool running nvidia-smi command each time it it probed.

I've tried getting a number of nvidia tools working on Docker before, and I think I see something in your Docker info that could be the problem @jangrewe. Namely while you have the nvidia runtime set, it is not your default. Perhaps that is the issue?

Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc

Thanks @nicklausbrown, i'll try running with --runtime nvidia --privileged to see if that fixes the intermittent errors - maybe using the proper runtime doesn't cause nvidia-smi and/or the exporter to trip up. 🙂

y3ti commented

Here is very good explanation of this issue: NVIDIA/nvidia-docker#1730