utkuozdemir/nvidia_gpu_exporter

got errors "couldn't parse number from: [n/a]"

kiroswu opened this issue · 13 comments

Describe the bug
executed command: # ./nvidia_gpu_exporter --web.listen-address :20127 --nvidia-smi-command="nvidia-smi" --log.level=debug
refresh nvidia-gpu-metrics dashboard in Grafana, then command console throws errors and dashboard shows nothing

To Reproduce
Steps to reproduce the behavior:

  1. Run command './nvidia_gpu_exporter --web.listen-address :20127 --nvidia-smi-command="nvidia-smi" --log.level=debug'
  2. See error
    image
    image

Expected behavior
dashboard shows metrics data normally

Model and Version
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |

  • GPU Model [e.g. GeForce RTX 2080 TI]
  • App version and architecture [' linux_x86_64']
  • Operating System [e.g. Ubuntu 18.04]
  • Nvidia GPU driver version [e.g. Linux driver nvidia-driver-450]

root@4d15723e44d8:/home# ./nvidia_gpu_exporter --version
nvidia_gpu_exporter, version 0.4.0 (branch: HEAD, revision: 76d7496)
build user: goreleaser
build date: 2022-02-08T00:42:44Z
go version: go1.17.5
platform: linux/amd64

could you give a suggestion? Thx!

Can you please run the following command and share the output here:

nvidia-smi --query-gpu="timestamp,driver_version,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending" --format=csv

Also, please when the exporter is running, hit the endpoint http://localhost:9835/metrics in your browser and share the output here?

The errors saying "couldn't parse number from..." are debug level logs and are expected, you don't need to be worried about them. Only logs with level=error are significant.

Can you please run the following command and share the output here:

nvidia-smi --query-gpu="timestamp,driver_version,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending" --format=csv

Also, please when the exporter is running, hit the endpoint http://localhost:9835/metrics in your browser and share the output here?

Thx for response @utkuozdemir .The command output as below:
timestamp, driver_version, count, name, serial, uuid, pci.bus_id, pci.domain, pci.bus, pci.device, pci.device_id, pci.sub_device_id, pcie.link.gen.current, pcie.link.gen.max, pcie.link.width.current, pcie.link.width.max, index, display_mode, display_active, persistence_mode, accounting.mode, accounting.buffer_size, driver_model.current, driver_model.pending, vbios_version, inforom.img, inforom.oem, inforom.ecc, inforom.pwr, gom.current, gom.pending, fan.speed [%], pstate, clocks_throttle_reasons.supported, clocks_throttle_reasons.active, clocks_throttle_reasons.gpu_idle, clocks_throttle_reasons.applications_clocks_setting, clocks_throttle_reasons.sw_power_cap, clocks_throttle_reasons.hw_slowdown, clocks_throttle_reasons.hw_thermal_slowdown, clocks_throttle_reasons.hw_power_brake_slowdown, clocks_throttle_reasons.sw_thermal_slowdown, clocks_throttle_reasons.sync_boost, memory.total [MiB], memory.used [MiB], memory.free [MiB], compute_mode, utilization.gpu [%], utilization.memory [%], encoder.stats.sessionCount, encoder.stats.averageFps, encoder.stats.averageLatency, ecc.mode.current, ecc.mode.pending, ecc.errors.corrected.volatile.device_memory, ecc.errors.corrected.volatile.dram, ecc.errors.corrected.volatile.register_file, ecc.errors.corrected.volatile.l1_cache, ecc.errors.corrected.volatile.l2_cache, ecc.errors.corrected.volatile.texture_memory, ecc.errors.corrected.volatile.cbu, ecc.errors.corrected.volatile.sram, ecc.errors.corrected.volatile.total, ecc.errors.corrected.aggregate.device_memory, ecc.errors.corrected.aggregate.dram, ecc.errors.corrected.aggregate.register_file, ecc.errors.corrected.aggregate.l1_cache, ecc.errors.corrected.aggregate.l2_cache, ecc.errors.corrected.aggregate.texture_memory, ecc.errors.corrected.aggregate.cbu, ecc.errors.corrected.aggregate.sram, ecc.errors.corrected.aggregate.total, ecc.errors.uncorrected.volatile.device_memory, ecc.errors.uncorrected.volatile.dram, ecc.errors.uncorrected.volatile.register_file, ecc.errors.uncorrected.volatile.l1_cache, ecc.errors.uncorrected.volatile.l2_cache, ecc.errors.uncorrected.volatile.texture_memory, ecc.errors.uncorrected.volatile.cbu, ecc.errors.uncorrected.volatile.sram, ecc.errors.uncorrected.volatile.total, ecc.errors.uncorrected.aggregate.device_memory, ecc.errors.uncorrected.aggregate.dram, ecc.errors.uncorrected.aggregate.register_file, ecc.errors.uncorrected.aggregate.l1_cache, ecc.errors.uncorrected.aggregate.l2_cache, ecc.errors.uncorrected.aggregate.texture_memory, ecc.errors.uncorrected.aggregate.cbu, ecc.errors.uncorrected.aggregate.sram, ecc.errors.uncorrected.aggregate.total, retired_pages.single_bit_ecc.count, retired_pages.double_bit.count, retired_pages.pending, temperature.gpu, temperature.memory, power.management, power.draw [W], power.limit [W], enforced.power.limit [W], power.default_limit [W], power.min_limit [W], power.max_limit [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.video [MHz], clocks.applications.graphics [MHz], clocks.applications.memory [MHz], clocks.default_applications.graphics [MHz], clocks.default_applications.memory [MHz], clocks.max.graphics [MHz], clocks.max.sm [MHz], clocks.max.memory [MHz], mig.mode.current, mig.mode.pending
2022/04/06 14:56:02.136, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-0e01d9aa-96e8-e62d-bc23-cd9d08891bef, 00000000:04:00.0, 0x0000, 0x04, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 0, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 30 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 41, N/A, Enabled, 53.99 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.166, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-6905b527-d2b9-8662-b825-fdb6e8ea4302, 00000000:05:00.0, 0x0000, 0x05, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 1, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 30 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 40, N/A, Enabled, 52.83 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.197, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-f0cec164-0b67-0fa4-dfac-7d2972a67010, 00000000:08:00.0, 0x0000, 0x08, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 2, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 30 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 41, N/A, Enabled, 54.50 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.229, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-caff9657-88c2-4324-8946-c0a6672b1728, 00000000:09:00.0, 0x0000, 0x09, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 3, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 29 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 41, N/A, Enabled, 54.97 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1260 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.257, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-2227faca-c5d5-a1a1-7791-e66e8514a41b, 00000000:85:00.0, 0x0000, 0x85, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 4, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 28 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 41, N/A, Enabled, 67.12 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.284, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-fa15f6d5-3e03-f28a-0db3-4cf1ac99afd9, 00000000:86:00.0, 0x0000, 0x86, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 5, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 33 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 1 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 41, N/A, Enabled, 45.59 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.314, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-c9c3d255-b3f4-22d8-1292-25b40fdffa4f, 00000000:89:00.0, 0x0000, 0x89, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 6, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 34 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 1 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 39, N/A, Enabled, 53.65 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1260 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]
2022/04/06 14:56:02.340, 450.57, 8, GeForce RTX 2080 Ti, [N/A], GPU-3c45d99c-fb47-8c66-1109-6033266b8feb, 00000000:8A:00.0, 0x0000, 0x8A, 0x00, 0x1E0710DE, 0x1E07107D, 3, 3, 16, 16, 7, Disabled, Disabled, Disabled, Disabled, 4000, [N/A], [N/A], 90.02.30.40.7D, G001.0000.02.04, 1.1, [N/A], [N/A], [N/A], [N/A], 16 %, P0, 0x00000000000001FF, 0x0000000000000000, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, Not Active, 11019 MiB, 0 MiB, 11019 MiB, Default, 0 %, 0 %, 0, 0, 0, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], [N/A], 40, N/A, Enabled, 49.79 W, 250.00 W, 250.00 W, 250.00 W, 100.00 W, 280.00 W, 1350 MHz, 1350 MHz, 7000 MHz, 1245 MHz, [N/A], [N/A], [N/A], [N/A], 2100 MHz, 2100 MHz, 7000 MHz, [N/A], [N/A]

The endpoint result as below attached file:
gpu_metrics.txt

I mentioned the endpoint request cost 28s, is it normal?
image

Thanks. I have just released v0.5.0 with more helpful logs, can you install it and publish the logs output here?

28 seconds is definitely not normal but not sure if it is related to the exporter. Please give it a try with 127.0.0.1 instead of localhost: http://127.0.0.1:9835/metrics.

Tried again using v0.5.0, same issue. Actually I request url via real_ip remotely, not 127 or localhost
Logs as attachment files:
command_log.txt
entpoint_response.txt

The request cost 1min using v0.5.0:
image

0.5.0 is not supposed to fix the issue, it doesn't contain any fixes.
It should just log the error in more detail. Can you check the error logs and post them here please?

Tried again using v0.5.0, same issue. Actually I request url via real_ip remotely, not 127 or localhost Logs as attachment files: command_log.txt entpoint_response.txt

The request cost 1min using v0.5.0: image

See the log file command_log.txt, seems it same with previous v0.4.0.

Ah ok I didn't see the attached documents.

Actually logs look ok and endpoint response has actual data in it. - See the power draw and temperature for example:

nvidia_smi_power_draw_watts{uuid="0e01d9aa-96e8-e62d-bc23-cd9d08891bef"} 54.01
nvidia_smi_power_draw_watts{uuid="2227faca-c5d5-a1a1-7791-e66e8514a41b"} 67.51
...
nvidia_smi_temperature_gpu{uuid="0e01d9aa-96e8-e62d-bc23-cd9d08891bef"} 43
nvidia_smi_temperature_gpu{uuid="2227faca-c5d5-a1a1-7791-e66e8514a41b"} 42
...

This means the exporter is working correctly and the misconfiguration is somewhere on the Prometheus scraping or on Grafana->Prometheus connection.

Can you go to prometheus UI and see if metrics are scraped?
Also go to settings and check the data sources and please verify that Prometheus data source is accessible.

I suspect that prometheus scraping fails because scrape takes too long, and by default scrape_timeout is 10 seconds. Maybe you try increasing this timeout to a big number (5m), restart prometheus and see if it can scrape afterwards? See scrape_timeout here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/

Thx a lot. It's OK now after modified the scrape_timeout.
Currently confused on if scraping more than 10 nodes & 4 pieces GPU in every node at one time, dashboard would be unavaliable for timeout. How to resolved it? doesn't make sense to enlarge scrape_timeout * 10

BTW. How can i implement summary table like id 11074 ? https://grafana.com/grafana/dashboards/11074
image

Scraping taking that long is still not normal, I would suspect a single node/single GPU having some sort of issue and therefore delaying the response, but we cannot know for sure. It needs deeper investigation from your side.

One tip: you can check if you are giving enough compute resources/memory to the exporter, prometheus and grafana. You can increase their resources and give it a try.

About the table, I am not a Grafana expert. What I would suggest is downloading that dashboard and having a look how they did it and maybe adapt it to my dashboard (contributions welcome 🙂).

Thx a lot. I will try it for dashboard change.
Also expect exporter performance optimization for mutli-nodes and dashboard upgrade.
Great project!

Looking forward to it!

My dashboard focuses on a single GPU and all panels are designed with that in mind. Dropdown allows a single-selection. I am not sure if this dashboard will work fine with multi-gpu support. Feel free to give a try but if it doesn't play well, making a separate dashboard can be a good idea as well.

I'll close this issue since there is nothing actionable at the moment. If you have more specific findings for the latency or if you improve dashboard etc., feel free to open another issue/pr.