vpenso/prometheus-slurm-exporter

Slurm exporter crashes on Slurm 20.11.8

Opened this issue · 15 comments

When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?

panic: runtime error: index out of range [4] with length 4

goroutine 12 [running]:
main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0)
/opt/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x0)
/opt/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660)
/opt/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/go/pkg/mod/github.com/prometheus/client_golang@v1.2.1/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/go/pkg/mod/github.com/prometheus/client_golang@v1.2.1/prometheus/registry.go:454 +0x5ce

Are you using -gpus-acct? I found that gpus.go needs to be updated.

- --format=Allocgres
+ --format=AllocTRES

Here's the error if you run the sacct command as is:

> sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

If you are using --gpus-acct try #73.

Hi all,

I am having the same issue. I tried both solutions (@wmoore28 and @itzsimpl).
My slurm version is 21.08.5

Does anyone have any workaround for the panic array length?
The process/service is killed with this panic.

PS.: The got test *.go runs fine for all test

Thanks

Reviewing the error listed in the first post this has nothing to do with gpus.go, but with node.go. Try running the command sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong on your system and examine the output. This is what node.go executes and then parses. Compare your output with what is in sinfo_mem.txt.

This fixed it in my environment. I can send PR if it fixes the issue globally
DImuthuUpe@3adffd8

Hi All,

Thanks for the quick reply.
The output of the command: sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong
Just some entries, because we have hundreds instances in cloud as well:

Please, consider the space rendered as a 'tab'

cloudsmp-r548-40 0 756878 0/48/0/48 idle~
cloudsmp-r548-41 0 756878 0/48/0/48 idle~
cloudsmp-r548-42 0 756878 0/48/0/48 idle~

node02 386016 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated

Comparing with sinfo_mem.txt it seems to be equal (5 columns)

Second test was run the @DImuthuUpe command:
sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong:"

The difference from the previous command it is a single space instead of a tab

cloudsmp-r548-37 0 756878 0/48/0/48 idle~
cloudsmp-r548-38 0 756878 0/48/0/48 idle~
cloudsmp-r548-39 0 756878 0/48/0/48 idle~
cloudsmp-r548-40 0 756878 0/48/0/48 idle~

node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed

I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground:
bin/prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
FATA[0026] exit status 1 source="gpus.go:101"

Then I amended the sacct command as suggested by @wmoore28 and it worked fine.

In resume, there are two changes::
1 - nodes.go as suggested by @DImuthuUpe
2 - gpus.go for who use -gpus-acct parameter as suggested by @wmoore28

A good improvement could be an automated test to sacct command (in this case for gpus):
sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

Thanks everyone for the quick help.
I will keep an eye on new versions available here with fixes applied.
I am pretty sure @vpenso will find a way to keep back compatibility in slurm versions.

Thanks again!

@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched gpus.go as there was an issue when gres does not use gpuType. Could you perhaps test if it works for you?

Thanks, @itzsimpl for integrating the fix

Hi @itzsimpl,

I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine.
The process is not being killed and the gpus information is coming well. Also, the go test worked fine.

Thank you so much for it!

mtds commented

@ALL : I have merged the updated PR #73 into the development branch

The master branch will be kept backward compatible to Slurm version (up to) 18.x.

Hello!
Are there any news and plans when this fix from the development branch will be released on the master/main branch?

@Sebi94nbg -- @vpenso and @mtds seem to have gone off line. Another bug fix about 20+ character node names with PR that was offered in MAY has also not been merged into main. 🤷

mtds commented

Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?

Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged
into master because we would like to keep backward compatible for the version of Slurm we are still using (18.x)
and this PR is not meant for old versions.

Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development
branch. If you are using newer version of Slurm (from 20.x onwards) our suggestion is to skip the master branch
and only use development.

Thanks for your feedback and information.

It's unfortunately not enough to only have a proper branch name here.

We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch.

That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments.

Currently, you're simply incrementing your release version. What about these release tags?

  • Releases 0.x => Slurm 18.x
  • Releases 1.x => Slurm 20.x

Then you would keep the backward compatibility and users could decide, which version / release they need or want.

In addition, you may also want to rename the development branch to something like slurm-20.x.

imoes commented

I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments.