/telegraf_nv_export

Ultra low overhead NVIDIA GPU telemetry plugin for telegraf with memory temperature readings.

Primary LanguageC++MIT LicenseMIT

nv_export

Ultra low overhead NVIDIA GPU telemetry plugin for telegraf with memory temperature readings.

Building

requirements: CUDA, CMake, C++23 capable compiler, libpci (optional)

mkdir build
cd build
# build with  -DDRAM_TELEMETRY=NO if your gpu is not yet supported / can't run as root
cmake ..
make
# sudo cp nv_export /etc/telegraf/

Since telegraf doesn't run the executable as root, you need to set the capabilities of the executable to allow reading /dev/mem:

sudo setcap cap_sys_rawio,cap_dac_override+ep ./nv_export

Telegraf Configuration

# ...
[[inputs.execd]]
    command = ["/etc/telegraf/nv_export"]
    data_format = "influx"
    signal = "none"
# ...

Notes on VRAM temperature readings

The hack is needed because calling nvmlDeviceGetFieldValues() with NVML_FI_DEV_MEMORY_TEMP returns error NVML_ERROR_NOT_SUPPORTED.

credits to olealgoritme/gddr6

Prerequisites

  • Kernel boot parameter: iomem=relaxed
sudo vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iomem=relaxed"
sudo update-grub
sudo reboot
  • Disabling Secure Boot

This can be done in the UEFI/BIOS configuration or using mokutil:

mokutil --disable-validation

Check state with:

$ sudo mokutil --sb
SecureBoot disabled

Dependencies

  • libpci-dev
sudo apt install libpci-dev -y

Supported GPUs

  • RTX 4090 (AD102)
  • RTX 4080 Super (AD103)
  • RTX 4080 (AD103)
  • RTX 4070 Ti Super (AD103)
  • RTX 4070 Ti (AD104)
  • RTX 4070 Super (AD104)
  • RTX 4070 (AD104)
  • RTX 3090 Ti (GA102)
  • RTX 3090 (GA102)
  • RTX 3080 Ti (GA102)
  • RTX 3080 (GA102)
  • RTX 3080 LHR (GA102)
  • RTX 3070 (GA104)
  • RTX 3070 LHR (GA104)
  • RTX A2000 (GA106)
  • RTX A4500 (GA102)
  • RTX A5000 (GA102)
  • RTX A6000 (AD102)
  • L4 (AD104)
  • L40 (AD102)
  • L40S (AD102)
  • A10 (GA102)