rivosinc/prometheus-slurm-exporter

Enhancement Request: Add per-partition node state metrics

Closed this issue · 4 comments

Hi - really appreciate the effort you've put in to this project - great work so far!

Would it be possible to add support for per-partition node state metrics, so that for partition X, you'd see a count of nodes in DRAIN/DOWN/ALLOC/... etc.? This would be really useful in tracking the relative usage of partitions in a heterogeneous cluster. This might be similar to tracking the output of a command like:
sinfo -h -o %D,%T -p
There may be some trickiness in deciding how to track node state for non-responsive nodes as sinfo will produce duplicate lines for non-responding nodes, e.g.

10,drained*
2,drained

showing that 10 nodes are drained and unresponsive, 2 are drained but responsive. Ideally the total number of nodes in the partition would be recorded, so ideally the metric would track all nodes in drained state, not just responsive ones (-r option to sinfo)

Howdy! Appreciate the detailed feature request. I definitely think this could be useful. That would seem to be a pretty simple addition to the following function as well as off course adding the metric description and collection. I do not believe we have a node that get replicated for multiple states. After doing a couple experiments and checking the docs, I anticipate the following cardinality from command line: # of nodes * # partition per each node So this should be as simple as:

func fetchNodePartitionMetrics(nodes []NodeMetric) map[string]*PartitionMetric {
	partitions := make(map[string]*PartitionMetric)
	for _, node := range nodes {
		for _, p := range node.Partitions {
			partition, ok := partitions[p]
			if !ok {
				partition = &PartitionMetric{
					StateAllocMemory: make(map[string]float64),
					StateAllocCpus:   make(map[string]float64),
					// new metric here
					StateNodeCount: make(map[string]float64),
				}
				partitions[p] = partition
			}
			partition.StateAllocCpus[node.State] += node.AllocCpus
			partition.StateAllocMemory[node.State] += node.AllocMemory
			partition. StateNodeCount[node.State] += 1
			// ...
		}
	}
	return partitions
}

Let call our new metric slurm_partition_state_node_count for now. Lets say our cluster has 12 nodes in the state you described and all of them are part of 2 partitions (hw-a and hw-b), we should expect the exporter to emit the following metrics on collect:

slurm_partition_state_node_count{partition="hw-a",state="drained*"} 2
slurm_partition_state_node_count{partition="hw-a",state="drained"} 10
slurm_partition_state_node_count{partition="hw-b",state="drained*"} 2
slurm_partition_state_node_count{partition="hw-b",state="drained"} 10

With the above modification and some other modification, I believe that's what the exporter will output.

Let me know if you'd like to give this a go. Otherwise, I can get this in in the next couple of day/weeks. :)

EDIT: fixed code snippet

Wow! That's a fast response! Yes, let me take a look and confirm, but looks exactly what's needed. And just to add, yes, would be very happy to test this out once ready.
I might not be following the logic correctly here, but should the line in the snippet above read:

    partition.StateNodeCount[node.State] += 1

rather than

    partition.StateAllocMemory[node.State] += 1

?

Good catch, thanks. I'll update the thread once this is done :)

@AndyMcVey Let me know how #87 is.