NAICNO/Jobanalyzer

Core bindings on each node (betzy)

Closed this issue · 2 comments

Can we extract momentary state per core in a binary form (bitvector) - high vs low usage at the moment. top/htop can do this so we should be able too. it's ok to look when we sample and not worry about history... For intuition, think about the htop graph and how the activity is distributed.

The rationale is that users get the core bindings completely wrong sometimes and we'd like to analyze patterns and generate alerts.

Sonar can now generate these data (NordicHPC/sonar#179) so how do we extract them and manipulate them? I think we should see the load data as separate from the sample data, they just happen to be encoded in the same stream. But it's a bit open what parse ought to do - should it correlate the sample stream and the load data stream and insert the latter into the former to recreate the original data stream? Possibly. Or should there be some kind of cross-reference from a (raw) sample record to a (raw) load data record?

There will probably be a new command, sonalyze top, which will extract and display the load data for a set of nodes, allowing the core bindings to be analyzed. This doesn't have to be very involved: one useful output would probably be, say, two bits per cpu per time step, indicating how busy it is on a log scale during the last time step. Ad-hoc reporting code would request these data and then process them further and report on them. There might be another useful output, which is the amount of cpu time (percent, absolute) used by a core during the same time step - same info, but maybe with different utility.

Representation-wise, the database delivers the sample stream and the load stream separately, and they are postprocessed and managed separately (and for all I know, maybe we want APIs that deliver one or the other but not both - it makes sense now that we have caching).

Sonalyze can now process these data, and print them (in human-readable format only, csv etc are coming).