Drain reason from sinfo

Question

Drain reason from sinfo

Closed this issue 4 years ago · 2 comments

This exporter is fantastic, and we're hoping to get a bit more out of it. I've been looking at the code for node status, and I'd really like to track our drain reasons. I think this would help us spot trends.

Where you are gathering the sinfo https://github.com/vpenso/prometheus-slurm-exporter/blob/master/nodes.go#L113
could you add in %E and grab the reason? What would need to accompany that change for printing it out properly?

Thanks

Answer 1 · 2020-10-12T12:48:09.000Z

Unless you have an extremely determined group of sysadmins, which consistently add a proper string into the 'reason' field, you may end up with something like the following (extracted from one of our cluster):

50,idle,none
6,reserved,none
1,drained*,TTS#202010[...]
2,draining,Kill task failed
1,draining,NHC: Watchdog timer unable to terminate hung NHC process 4312.
11,down*,reboot timed out
1,down*,TTS#202010[...]
1,down*,HW problem: node offline
[...]

IMHO, adding the reason will increase the size of the timeseries without giving you much benefits.
Unless you let Slurm do the job (e.g. Kill task failed), the risk is ending up with a fragmented view
of the status. Plus, I am very skeptical you'll be able to spot some trends from a bunch of strings.
This could a job more fit for accounting/reporting than a live dashboard (but in the former case you
will not need an exporter, just the sinfo/sreport utilities from Slurm itself).

Answer 2 · 2024-08-09T07:57:28.000Z

I understand your point about not cluttering the metric with random strings, but in our case it would be very useful to tell the difference between a node draining by Slurm vs a node draining by an operator, as we do not want to alert for the latter.