[Bug]: Nvidia GPU does not show in metrics
danielkinahan opened this issue · 2 comments
danielkinahan commented
Bug description
GPU doesn't show as an available metric on my node.
Expected behavior
I'd like to see these metrics.
Steps to reproduce
- Install nvidia drivers and nvidia container toolkit
- docker compose up -d
services:
netdata:
image: netdata/netdata
container_name: netdata
hostname: netdata.mydomain.local
pid: host
restart: unless-stopped
cap_add:
- SYS_PTRACE
- SYS_ADMIN
security_opt:
- apparmor:unconfined
volumes:
- ./netdataconfig/netdata:/etc/netdata
- lib:/var/lib/netdata
- cache:/var/cache/netdata
- /etc/passwd:/host/etc/passwd:ro
- /etc/group:/host/etc/group:ro
- /etc/localtime:/etc/localtime:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /etc/os-release:/host/etc/os-release:ro
- /var/log:/host/var/log:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /run/dbus:/run/dbus:ro
networks:
caddy:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
volumes:
lib:
cache:
networks:
caddy:
name: caddy
external: true
Installation method
docker
System info
$ uname -a; grep -HvE "^#|URL" /etc/*release
Linux server 6.1.0-20-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="12"
/etc/os-release:VERSION="12 (bookworm)"
/etc/os-release:VERSION_CODENAME=bookworm
/etc/os-release:ID=debian
$ docker exec -it netdata uname -a; grep -HvE "^#|URL" /etc/*release
Linux netdata.mydomain.local 6.1.0-20-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
/etc/os-release:PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
/etc/os-release:NAME="Debian GNU/Linux"
/etc/os-release:VERSION_ID="12"
/etc/os-release:VERSION="12 (bookworm)"
/etc/os-release:VERSION_CODENAME=bookworm
/etc/os-release:ID=debian
Netdata build info
$ docker exec -it netdata netdata -W buildinfo
Packaging:
Netdata Version ____________________________________________ : v1.45.0-236-nightly
Installation Type __________________________________________ : oci
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ : unknown
Configure Options __________________________________________ : dummy-configure-command
Default Directories:
User Configurations ________________________________________ : /etc/netdata
Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
Permanent Databases ________________________________________ : /var/lib/netdata
Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /usr/share/netdata/web
Log Files __________________________________________________ : /var/log/netdata
Lock Files _________________________________________________ : /var/lib/netdata/lock
Home _______________________________________________________ : /var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 6.1.0-20-amd64
Operating System ___________________________________________ : Debian GNU/Linux
Operating System ID ________________________________________ : debian
Operating System ID Like ___________________________________ : unknown
Operating System Version ___________________________________ : 12 (bookworm)
Operating System Version ID ________________________________ : 12
Detection __________________________________________________ : /host/etc/os-release
Hardware:
CPU Cores __________________________________________________ : 8
CPU Frequency ______________________________________________ : 3600000000
RAM Bytes __________________________________________________ : 33567473664
Disk Capacity ______________________________________________ : 5001002754048
CPU Architecture ___________________________________________ : x86_64
Virtualization Technology __________________________________ : none
Virtualization Detection ___________________________________ : none
Container:
Container __________________________________________________ : docker
Container Detection ________________________________________ : dockerenv
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : Debian GNU/Linux
Container Operating System ID ______________________________ : debian
Container Operating System ID Like _________________________ : unknown
Container Operating System Version _________________________ : 12 (bookworm)
Container Operating System Version ID ______________________ : 12
Container Operating System Detection _______________________ : /etc/os-release
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Back-filling (of higher database tiers) ____________________ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine (compression) _____________________________________ : YES (zstd lz4)
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : YES
ZSTD (fast, lossless compression algorithm) ________________ : YES
zlib (lossless data-compression library) ___________________ : YES
Brotli (generic-purpose lossless compression algorithm) ____ : NO
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : NO
ebpf (monitor system calls) ________________________________ : NO
freeipmi (monitor enterprise server H/W) ___________________ : YES
nfacct (gather netfilter accounting) _______________________ : NO
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : NO
Xen VBD Error Tracking _____________________________________ : NO
Logs Management ____________________________________________ : YES
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : YES
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO
Additional info
The container toolkit is configured correctly and the plugin been configured.
$ docker exec -iu netdata -t netdata /usr/libexec/netdata/plugins.d/go.d.plugin -d -m nvidia-smi
DBG godplugin/main.go:144 plugin: name=go.d, version=1.45.0.236 component=agent
DBG godplugin/main.go:146 current user: name=netdata, uid=201 component=agent
INF godplugin/main.go:150 env HTTP_PROXY '', HTTPS_PROXY '' component=agent
INF agent/agent.go:145 instance is started component=agent
INF agent/setup.go:23 loading config file component=agent
DBG agent/setup.go:31 looking for 'go.d.conf' in [/etc/netdata /usr/lib/netdata/conf.d] component=agent
INF agent/setup.go:38 found /usr/lib/netdata/conf.d/go.d.conf component=agent
INF agent/setup.go:45 config successfully loaded component=agent
INF agent/agent.go:149 using config: enabled 'true', default_run 'true', max_procs '0' component=agent
INF agent/setup.go:50 loading modules component=agent
INF agent/setup.go:73 enabled/registered modules: 0/83 component=agent
INF agent/agent.go:162 no modules to run component=agent
$ docker exec -it netdata nvidia-smi
Tue Apr 23 11:05:38 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P400 On | 00000000:02:00.0 Off | N/A |
| 34% 30C P8 N/A / 30W | 1MiB / 2048MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
ilyam8 commented
Hi. It is nvidia_smi and you need to enable it in go.d.conf. Check nvidia smi collector documentation.
danielkinahan commented
That fixed it thanks!