Failed to initialize NVML: Unknown Error
jangrewe opened this issue · 4 comments
Describe the bug
I'm running the current version of your Docker image, and it works most of the time - but sometimes it starts to fail, and i need to restart the container.
It sometimes runs for a whole day, and sometimes only a couple of minutes.
To Reproduce
Steps to reproduce the behavior:
- Systemd Unit ExecStart:
/usr/bin/docker run --name prometheus-nvidia-gpu-exporter \
--gpus all \
-p 9835:9835 \
-v /dev/nvidiactl:/dev/nvidiactl \
-v /dev/nvidia0:/dev/nvidia0 \
-v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so \
-v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 \
-v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
utkuozdemir/nvidia_gpu_exporter:1.2.0
Expected behavior
I'd expect the exporter to not start throwing errors ;-)
Console output
(Disregard the mismatching timestamps, i copypasta'd the error first, and then also added the initial log when starting the container.)
May 24 19:01:22 hades systemd[1]: Stopped Prometheus Nvidia GPU Exporter.
May 24 19:01:22 hades systemd[1]: Starting Prometheus Nvidia GPU Exporter...
May 24 19:01:22 hades docker[1915038]: prometheus-nvidia-gpu-exporter
May 24 19:01:23 hades docker[1915048]: 1.2.0: Pulling from utkuozdemir/nvidia_gpu_exporter
May 24 19:01:23 hades docker[1915048]: Digest: sha256:cc407f77ab017101ce233a0185875ebc75d2a0911381741b20ad91f695e488c7
May 24 19:01:23 hades docker[1915048]: Status: Image is up to date for utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades docker[1915048]: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades systemd[1]: Started Prometheus Nvidia GPU Exporter.
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:232 level=info msg="Listening on" address=[::]:9835
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:235 level=info msg="TLS is disabled." http2=false address=[::]:9835
[...]
May 24 19:00:45 hades docker[1903720]: ts=2023-05-24T17:00:45.428Z caller=exporter.go:184 level=error error="error running command: exit status 255: command failed. code: 255 | command: nvidia-smi --query-gpu=timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,fabric.state,fabric.status --format=csv | stdout: Failed to initialize NVML: Unknown Error\n | stderr: "
(The error from the title is at the end of this very long last line.)
Model and Version
- GPU Model: RTX 4070 Ti
- App version: 1.2.0 am64
- Installation method: Docker image
- Operating System: Debian 11/bullseye
- Nvidia GPU driver version:
Running on Docker with Nvidia Container Toolkit:
$ docker info
Client: Docker Engine - Community
Version: 24.0.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.4
Path: /usr/libexec/docker/cli-plugins/docker-buildx
Server:
Containers: 84
Running: 83
Paused: 0
Stopped: 1
Images: 87
Server Version: 24.0.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.10.0-23-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 125.7GiB
Docker Root Dir: /srv/docker
Debug Mode: false
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
$ dpkg -l | grep nvidia
ii libnvidia-container-tools 1.13.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.13.1-1 amd64 NVIDIA container runtime library
ii nvidia-container-toolkit 1.13.1-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.13.1-1 amd64 NVIDIA Container Toolkit Base
$ nvidia-smi
Wed May 24 19:10:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:42:00.0 Off | N/A |
| 0% 56C P2 34W / 285W | 5122MiB / 12282MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 937698 C /usr/bin/zmc 225MiB |
| 0 N/A N/A 3332933 C python3 1838MiB |
| 0 N/A N/A 3469008 C python 3056MiB |
+-----------------------------------------------------------------------------+
It seems the error on stdout is
Failed to initialize NVML: Unknown Error
When I Google the error, I find very similar issus such as:
- NVIDIA/nvidia-docker#1671
- https://bobcares.com/blog/docker-failed-to-initialize-nvml-unknown-error/#:~:text=queries%20and%20issues.-,How%20to%20resolve%20docker%20failed%20to%20initialize%20NVML%20unknown%20error,and%20the%20GPUs%20return%20available.
Can you have a look at them? I don't think this is an issue with the exporter, because the exporter is just a dumb tool running nvidia-smi
command each time it it probed.
I've tried getting a number of nvidia tools working on Docker before, and I think I see something in your Docker info that could be the problem @jangrewe. Namely while you have the nvidia runtime set, it is not your default. Perhaps that is the issue?
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Thanks @nicklausbrown, i'll try running with --runtime nvidia --privileged
to see if that fixes the intermittent errors - maybe using the proper runtime doesn't cause nvidia-smi and/or the exporter to trip up. 🙂
Here is very good explanation of this issue: NVIDIA/nvidia-docker#1730