utkuozdemir/nvidia_gpu_exporter

Exporter not scrapping metrics

maheshkolhe1 opened this issue · 3 comments

Describe the bug
Exporter is not able to gather information.

Console Output

./nvidia_gpu_exporter --query-field-names="AUTO" --log.level=debug

level=info ts=2022-03-23T00:23:30.344Z caller=main.go:65 msg="Listening on address" address=:9835
level=info ts=2022-03-23T00:23:30.346Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 2022/03/22 20:23:46.283" query_field_name=timestamp raw_value="2022/03/22 20:23:46.283"
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 510.47.03" query_field_name=driver_version raw_value=510.47.03
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: nvidia a100-pcie-40gb" query_field_name=name raw_value="NVIDIA A100-PCIE-40GB"
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: gpu-7bdeeff7-f7c6-e13c-f368-227523e670a7" query_field_name=uuid raw_value=GPU-7bdeeff7-f7c6-e13c-f368-227523e670a7
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 00000000:17:00.0" query_field_name=pci.bus_id raw_value=00000000:17:00.0

$ ./ nvidia-smi --query-gpu="timestamp,driver_version" --format=csv
timestamp, driver_version
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03

$ nvidia-smi
Tue Mar 22 20:28:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+

Atoms commented

Having the same issue, with 4XX drivers too...

 # nvidia-smi
Wed Jan 12 21:53:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
# nvidia-smi --query-gpu="timestamp,driver_version" --format=csv
timestamp, driver_version
2022/01/12 21:52:37.808, 470.103.01
# nvidia_gpu_exporter --query-field-names="AUTO" --log.level=debug
ts=2022-01-12T21:43:03.837Z caller=main.go:66 level=info msg="Listening on address" address=:9835
ts=2022-01-12T21:43:03.838Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false

ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: 2022/01/12 21:43:33.119" query_field_name=timestamp raw_value="2022/01/12 21:43:33.119"
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: 470.103.01" query_field_name=driver_version raw_value=470.103.01
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: gpu-4c1eef4b-a66f-6aad-8a50-c93fe2031827" query_field_name=uuid raw_value=GPU-4c1eef4b-a66f-6aad-8a50-c93fe2031827
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: 00000000:41:00.0" query_field_name=pci.bus_id raw_value=00000000:41:00.0
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: [n/a]" query_field_name=driver_model.current raw_value=[N/A]
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: [n/a]" query_field_name=driver_model.pending raw_value=[N/A]
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: 94.04.57.00.08" query_field_name=vbios_version raw_value=94.04.57.00.08
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: g190.0510.00.02" query_field_name=inforom.img raw_value=G190.0510.00.02
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: [n/a]" query_field_name=inforom.pwr raw_value=[N/A]
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: [n/a]" query_field_name=gom.current raw_value=[N/A]
ts=2022-01-12T21:43:33.142Z caller=exporter.go:180 level=debug error="couldn't parse number from: [n/a]" query_field_name=gom.pending raw_value=[N/A]
Atoms commented

Ok, issue is that user which is running nvidia_gpu_exporter should be in video group, and be able to execute nvidia-smi binary

Thanks for sharing the solution. I will look into adding some logs to debug such issues easily in the future.