NAICNO/Jobanalyzer

Job query fails to report a job with a single sample

Closed this issue · 6 comments

I have a job on fox (756722 on gpu-9 on aug 5) that had only a single sample b/c the job is mostly idle and sampling only kicks in once it has been running for a while. This job is not reported by the job query, and there can be good reasons for that (we can't compute record-to-record differences in cpu time with only one record) but at the same time it is annoying that this job just disappears. Filing this as a bug but probably want to investigate a little more.

$ ./sonalyze parse -data-dir ~/sonar/data/fox.educloud.no -j 756722 -from 2024-08-05 -to 2024-08-05 \
    -fmt all,csvnamed
version=0.11.0,localtime=2024-08-05 08:40,host=gpu-9.fox,cores=0,memtotal=0,user=ec-ewinge,pid=1249908,
job=756722,cmd=pt_main_thread,cpu_pct=100.5,mem_gb=3,res_gb=2,gpus=0,gpu_pct=0,gpumem_pct=0,
gpumem_gb=48,gpu_status=0,cputime_sec=279,rolledup=0,cpu_util_pct=0
$ ./sonalyze jobs -data-dir ~/sonar/data/fox.educloud.no -j 756722 -from 2024-08-05 -to 2024-08-05 -u -
jobm  user  duration  host  cpu-avg  cpu-peak  mem-avg  mem-peak  gpu-avg  gpu-peak  gpumem-avg  gpumem-peak  cmd
$ 

Filtering nukes it:

Streams constructed by postprocessing: 1
Samples retained after filtering: 1
Jobs constructed by merging: 1
Jobs discarded by aggregation filtering: 1
Jobs after aggregation filtering: 0
Number of jobs after output filtering: 0

Duh, this is because the default value for -min-samples is 2. That is a super obscure switch. Initially that was implemented to filter short-running jobs b/c "not interesting", and the reality is that some short-running jobs will not be shown anyway (those running for less than the sampling interval). But there's a difference between jobs that are evident in the sonar logs and those that aren't there, because not all sonalyze commands filter in the same way. It is confusing to see a record using eg sonalyze parse but to not see it using sonalyze jobs.

If we're going to loosen this up, should be sure the switch is not used for other verbs too.

On branch larstha-550-verbose-filters, probably good to go for record filters and job filters, but want to do the other verbs too.

It would appear that only sonalyze jobs uses -min-samples. Both sonalyze metadata and sonalyze parse support merging streams into multi-process multi-node jobs, but will never discard any resulting streams that have only one sample. It seems to me that the right fix here is to set the -min-samples value to 1 (conservatively... even though 0 ought to mean the same thing...) and reconsider if there is any fallout.

Additionally -min-samples could be added to metadata and parse, to make things symmetric. And the help text everywhere should make it clear that it applies to merged streams (jobs).