nvidia_smi 插件报错

Question

nvidia_smi 插件报错

Opened this issue 3 months ago · 8 comments

Relevant config.toml

# interval = 15

# exec local command
# e.g. nvidia_smi_command = "nvidia-smi"
nvidia_smi_command = "nvidia-smi"

# exec remote command
# nvidia_smi_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null SSH_USER@SSH_HOST nvidia-smi"

# Comma-separated list of the query fields.
# You can find out possible fields by running `nvidia-smi --help-query-gpus`.
# The value `AUTO` will automatically detect the fields to query.
query_field_names = "AUTO"

# query_timeout is used to set the query timeout to avoid the delay of date collection.
query_timeout = "5s"

Logs from categraf

Sep 19 16:23:52 zj-4090-59 categraf[79833]: 2024/09/19 16:23:52 metrics_agent.go:276: E! failed to init input: local.nvidia_smi error: unexpected query field: vgpu_driver_capability.heterogenous_multivGPU

System info

Ubuntu 22.04

Docker

No response

Steps to reproduce

1.开启 nvidia_smi 插件
2. 正常有监控数据
3. 显卡出问题了，掉卡了，通常表现为 nvidia_smi 命令卡住出不来，或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。

categraf 会不再上报显卡监控数据，导致告警失效。
...

Expected behavior

有显卡掉卡时监控数据，其他正常的卡的监控数据可以继续正常上报，

Actual behavior

不再上报显卡相关的监控数据

Additional info

No response

Answer 1 · 2024-09-19T11:34:01.000Z

打开nvidia_timeout 呢

Answer 2 · 2024-09-20T01:53:02.000Z

这次执行 nvidia-smi --query-gpu 时的错误，红色部分

Answer 3 · 2024-09-20T01:58:45.000Z

打开nvidia_timeout 呢

配置里的 query_timeout = "5s" 吗？这个时开着的。
nvidia-smi 命令卡住值 timeout 等一年也是卡住的。可以忽略这种卡住的情况，插件就是无法工作，无法处理的。
但是上面我刚发的图的这个，我靠，插件调用的这个命令有返回，但是个错误的好像也没法处理了。无解了

Answer 4 · 2024-09-20T08:52:10.000Z

那不应该，超时后，会调用kill命令

Answer 5 · 2024-09-24T01:35:48.000Z

那不应该，超时后，会调用kill命令

kill 掉也没用啊，下次查询还会卡住，再查询再卡住再 kill。循环往复，从显卡故障后就没有监控数据上报了

Answer 6 · 2024-09-24T01:36:52.000Z

那你应该修复故障啊，源头挂了，你要采集器帮你修？

Answer 7 · 2024-09-27T09:08:26.000Z

我的意思是监控采集不到故障信息，无法做对应告警配置

Answer 8 · 2024-09-27T15:01:30.000Z

no data 可以用absent之类的函数