vpenso/prometheus-slurm-exporter

Nested accounts missing from fairshare

Xaraxia opened this issue · 2 comments

Hi,

We have a nested account arrangement, and those accounts aren't properly being reported on.

I dug into the code, and the command is:

$ sshare -n -P -o account,fairshare
root|0.500000
 top_1|0.999998
  nested_1_1|0.999998
  nested_1_2|1.000000
   nested_1_2_1|1.000000
 top_2|0.481723
  nested_2_1|0.858038
   nested_2_2|0.961831

However when I get the metrics, I only get root, top_1 and top_2.

'root' isn't useful. top accounts are useful as an aggregate, but I'd also like to see the nested accounts.

Ideally, we would have "slurm_account_fairshare" as it is, and also offer "slurm_subaccount_fairshare" so that I could graph both.

Looks like ParseFairShareMetrics() is the culprit, throwing away anything that starts with more than one space.

                if ! strings.HasPrefix(line,"  ") {

I can see the argument for doing it, hence my proposal to gather two sets of metrics.

This is what is actually coming out of the exporter:

slurm_account_fairshare{account="top_1"} 0.999998
slurm_account_fairshare{account="root"} 1
slurm_account_fairshare{account="top_2"} 0.481723

So perhaps the right answer is to do

slurm_account_fairshare{account="root"} 1
slurm_account_fairshare{account="top_1", parent_account="root", account_depth="1"} 0.999998
slurm_account_fairshare{account="nested_1_2", parent_account="top_1", account_depth="2"} 1.000000
slurm_account_fairshare{account="nested_1_2_1", parent_account="nested_1_2", account_depth="3"} 1.000000

I'm happy to cut some code to do this if you can give me some recommendations.

Tangentally related, but noting here in case anyone journeys past here looking for it as I did. I was looking into something similar, where fairshare metrics were missing from all accounts. When the fair tree fairshare algorithm is used (changed in slurm 19.05+ to be the default), sshare makes no attempt to calculate a fairshare metric for anything other than users directly. For accounts, a (double)NO_VAL64 is hardcoded, and this appears to be rendered as a blank: https://github.com/SchedMD/slurm/blob/master/src/sshare/process.c#L261

This manifests as the exported reporting 0 for all accounts. We considered patching the exporter to report back LevelFS instead, which is produced by sshare for accounts, but not sure how best to deal with infinity.