Drain reason from sinfo
Closed this issue · 2 comments
This exporter is fantastic, and we're hoping to get a bit more out of it. I've been looking at the code for node status, and I'd really like to track our drain reasons. I think this would help us spot trends.
Where you are gathering the sinfo https://github.com/vpenso/prometheus-slurm-exporter/blob/master/nodes.go#L113
could you add in %E and grab the reason? What would need to accompany that change for printing it out properly?
Thanks
Unless you have an extremely determined group of sysadmins, which consistently add a proper string into the 'reason' field, you may end up with something like the following (extracted from one of our cluster):
50,idle,none
6,reserved,none
1,drained*,TTS#202010[...]
2,draining,Kill task failed
1,draining,NHC: Watchdog timer unable to terminate hung NHC process 4312.
11,down*,reboot timed out
1,down*,TTS#202010[...]
1,down*,HW problem: node offline
[...]
IMHO, adding the reason will increase the size of the timeseries without giving you much benefits.
Unless you let Slurm do the job (e.g. Kill task failed
), the risk is ending up with a fragmented view
of the status. Plus, I am very skeptical you'll be able to spot some trends from a bunch of strings.
This could a job more fit for accounting/reporting than a live dashboard (but in the former case you
will not need an exporter, just the sinfo
/sreport
utilities from Slurm itself).
I understand your point about not cluttering the metric with random strings, but in our case it would be very useful to tell the difference between a node draining by Slurm vs a node draining by an operator, as we do not want to alert for the latter.