vpenso/prometheus-slurm-exporter

Exporter dies when Slurm accounting not enabled

Closed this issue · 5 comments

Hey,

We're using this to monitor a small Slurm cluster, and it's very useful, thanks! Facing an issue however, after recently upgrading to 0.17.

In ParseAllocatedGPUs(), sacct is executed to get some data. We don't use Slurm accounting, so the subprocess exits with code 1 to show failure. Execute() receives the non-zero code, and considers this fatal, killing the entire exporter.

I'm happy to attempt a fix myself, but do you have any suggestions for a good logic flow in this case?

Perhaps something like an optional argument to Execute() that designates "allowable" exit codes; meaning blank data is returned and execution continues.

I do not know how, since I am not a Go programmer, but what I would suggest is adding a command-line flag --disable-accounting to the project that when passed disables all submodules that depend on calls to sacct. You can find them by running git grep sacct.

In my PR #43 I have added uniform error reporting for failing commands which you may find useful.

mtds commented

@Lobstros : take a look into the gpus_acct branch.

With this version, GPUs accounting is turned off by default. It has to be explicitly enabled via the option -gpus-acct=true on the command line, while launching the exporter, otherwise this feature will be excluded.

mtds commented

@Lobstros : with version 0.19 of this exporter, GPUs accounting is by default off.
@Rovanion : thanks for your PR but since it added lot of changes, we did not have the time to review it. Nevertheless, rest assured that it is on our radars.

Great news, thanks. Apologies for the wait—I'd intended to test it earlier. But have just installed 0.19 and it works fine.

Thank you @mtds!