Fan status and card count requirements
guleng opened this issue · 2 comments
guleng commented
I have a need to check the GPU fan speed or check its fan health status, because we have hundreds of fans that are damaged and unknown. In addition, we need to check the metrics of how many GPU cards a machine has to judge whether the card has dropped And whether the card has Huang card dead state monitoring
likueimo commented
According to DCGM API, I guess that you can use DCGM_FI_DEV_FAN_SPEED
.
Add to default-counters.csv.
RenaudWasTaken commented
Thanks for helping @likueimo !
That's pretty much what you need to do :)
More details here: https://github.com/NVIDIA/gpu-monitoring-tools#changing-the-metrics