NVIDIA/gpu-monitoring-tools

Fan status and card count requirements

guleng opened this issue · 2 comments

I have a need to check the GPU fan speed or check its fan health status, because we have hundreds of fans that are damaged and unknown. In addition, we need to check the metrics of how many GPU cards a machine has to judge whether the card has dropped And whether the card has Huang card dead state monitoring

According to DCGM API, I guess that you can use DCGM_FI_DEV_FAN_SPEED.
Add to default-counters.csv.

Thanks for helping @likueimo !
That's pretty much what you need to do :)
More details here: https://github.com/NVIDIA/gpu-monitoring-tools#changing-the-metrics