NAICNO/Jobanalyzer

sacct logging, ingestion, storage, query, and analysis

Closed this issue · 4 comments

This is for the first step of #66. sacct data are just the results of querying the slurmdb. We can use sacct instead of the slurm REST service (which is not installed on fox and might not be on other systems). The idea is to log sacct data one one supercomputer node every so often (probably hourly, with some data overlap) and exfiltrate to jobanalyzer, and then queries run against this data and the sonar data in the normal way.

  • Determine how we can use sacct data and which data to use
  • Write code to extract this data (this is the code/sacctd directory)
  • Write code to ingest and store this data (additions to sonalyze add and the sonalyze db layer)
  • Write code to read these data directly, for testing (additions to sonalyze db layer)
  • Prototype deployment on naic-monitor and a fox node (separately from production data pipeline)
  • Write code to run queries against the data, including cleaning the data up (postprocessing, query verbs, filtering)
  • Write code that runs those queries and massages the data (prototype report)

Must have before deploy:

  • Maybe a few more data fields (but we can do a lot with what we have): Partition, AllocTRES (see later)
  • Make sure all filtering is properly implemented, it's been pretty ad-hoc
  • Filter by partition probably
  • add filtering by more fields: Nodes; GPU count and GPU type
  • Especially test filtering of duplicates
  • array job support
  • move all the adhoc-reports changes to #541
  • More printing / plotting options? probably at least to select job type (regular, array) - or is this a filter?

Can maybe wait, if so move to a new bug:

  • Maybe a little more testing (but things seem pretty stable)
  • Standard filtering of eg dates?
  • Even more filtering
  • Basic regression tests would be good
  • het jobs (probably can wait?)
  • Verbose printing of filtering options a la jobs
  • There have been issues where use > reserved, is this a thing, a bug, or what?
  • Documentation / help?

In-flight code for this is on larstha-549-sacct.

It's worth noting that there are jobs that will be reported by sacct that will not be seen by sonar because they run for too short a time to be sampled or don't rack up sufficient cpu time for sonar to want to report on them. So when we process sacct data we must always be open to the possibility that the jobs will be unknown to sonar.

This will land when #66 lands. This is on the branch for #66 but it will land before #66 is finished, as that is a more general issue than this.

Now on larstha-549-sacct

AllocTRES has a comma-separated format, see AccountingStorageTRES in slurm.conf(5). Note this,
"NOTE: Setting gres/gpu will also set gres/gpumem and gres/gpuutil. gres/gpumem and gres/gpuutil can be set individually when gres/gpu is not set." But as of now there sre no gpuutil or gpumem fields here, maybe they are coming.