Job query fails to report a job with a single sample
Closed this issue · 6 comments
I have a job on fox (756722 on gpu-9 on aug 5) that had only a single sample b/c the job is mostly idle and sampling only kicks in once it has been running for a while. This job is not reported by the job query, and there can be good reasons for that (we can't compute record-to-record differences in cpu time with only one record) but at the same time it is annoying that this job just disappears. Filing this as a bug but probably want to investigate a little more.
$ ./sonalyze parse -data-dir ~/sonar/data/fox.educloud.no -j 756722 -from 2024-08-05 -to 2024-08-05 \
-fmt all,csvnamed
version=0.11.0,localtime=2024-08-05 08:40,host=gpu-9.fox,cores=0,memtotal=0,user=ec-ewinge,pid=1249908,
job=756722,cmd=pt_main_thread,cpu_pct=100.5,mem_gb=3,res_gb=2,gpus=0,gpu_pct=0,gpumem_pct=0,
gpumem_gb=48,gpu_status=0,cputime_sec=279,rolledup=0,cpu_util_pct=0
$ ./sonalyze jobs -data-dir ~/sonar/data/fox.educloud.no -j 756722 -from 2024-08-05 -to 2024-08-05 -u -
jobm user duration host cpu-avg cpu-peak mem-avg mem-peak gpu-avg gpu-peak gpumem-avg gpumem-peak cmd
$
Filtering nukes it:
Streams constructed by postprocessing: 1
Samples retained after filtering: 1
Jobs constructed by merging: 1
Jobs discarded by aggregation filtering: 1
Jobs after aggregation filtering: 0
Number of jobs after output filtering: 0
Duh, this is because the default value for -min-samples
is 2. That is a super obscure switch. Initially that was implemented to filter short-running jobs b/c "not interesting", and the reality is that some short-running jobs will not be shown anyway (those running for less than the sampling interval). But there's a difference between jobs that are evident in the sonar logs and those that aren't there, because not all sonalyze commands filter in the same way. It is confusing to see a record using eg sonalyze parse
but to not see it using sonalyze jobs
.
If we're going to loosen this up, should be sure the switch is not used for other verbs too.
On branch larstha-550-verbose-filters, probably good to go for record filters and job filters, but want to do the other verbs too.
It would appear that only sonalyze jobs
uses -min-samples
. Both sonalyze metadata
and sonalyze parse
support merging streams into multi-process multi-node jobs, but will never discard any resulting streams that have only one sample. It seems to me that the right fix here is to set the -min-samples
value to 1 (conservatively... even though 0 ought to mean the same thing...) and reconsider if there is any fallout.
Additionally -min-samples
could be added to metadata
and parse
, to make things symmetric. And the help text everywhere should make it clear that it applies to merged streams (jobs).