nvidia_smi 插件报错
Opened this issue · 8 comments
Derek-zd commented
Relevant config.toml
# interval = 15
# exec local command
# e.g. nvidia_smi_command = "nvidia-smi"
nvidia_smi_command = "nvidia-smi"
# exec remote command
# nvidia_smi_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null SSH_USER@SSH_HOST nvidia-smi"
# Comma-separated list of the query fields.
# You can find out possible fields by running `nvidia-smi --help-query-gpus`.
# The value `AUTO` will automatically detect the fields to query.
query_field_names = "AUTO"
# query_timeout is used to set the query timeout to avoid the delay of date collection.
query_timeout = "5s"
Logs from categraf
Sep 19 16:23:52 zj-4090-59 categraf[79833]: 2024/09/19 16:23:52 metrics_agent.go:276: E! failed to init input: local.nvidia_smi error: unexpected query field: vgpu_driver_capability.heterogenous_multivGPU
System info
Ubuntu 22.04
Docker
No response
Steps to reproduce
1.开启 nvidia_smi 插件
2. 正常有监控数据
3. 显卡出问题了,掉卡了,通常表现为 nvidia_smi 命令卡住出不来,或命令报错 nable to determine the device handle for GPU0000:CF:00.0: Unknown Errornon-zero return code 。使用 nvidia_smi -L 命令可以看到正常的卡和错误的卡。
- categraf 会不再上报 显卡监控数据,导致 告警失效。
...
Expected behavior
有显卡掉卡时监控数据,其他正常的卡的监控数据可以继续正常上报,
Actual behavior
不再上报显卡相关的监控数据
Additional info
No response
kongfei605 commented
打开nvidia_timeout 呢
Derek-zd commented
Derek-zd commented
打开nvidia_timeout 呢
配置里的 query_timeout = "5s" 吗?这个时开着的。
nvidia-smi 命令卡住值 timeout 等一年也是卡住的。可以忽略这种卡住的情况,插件就是无法工作,无法处理的。
但是上面我刚发的图的这个,我靠,插件调用的这个命令有返回,但是个错误的好像也没法处理了。无解了
kongfei605 commented
那不应该, 超时后,会调用kill命令
Derek-zd commented
那不应该, 超时后,会调用kill命令
kill 掉也没用啊,下次查询还会卡住,再 查询 再卡住再 kill。循环往复,从显卡故障后就没有监控数据上报了
kongfei605 commented
那你应该修复故障啊,源头挂了,你要采集器帮你修?
Derek-zd commented
我的意思是监控采集不到故障信息,无法做对应告警配置
kongfei605 commented
no data 可以用absent之类的函数