broadinstitute/cromshell

CRITICAL: job status reported by "cromshell list -u" is incorrect and never updates.

dalessioluca opened this issue · 6 comments

cromshell list -u is supposed to check completion status of all unfinished jobs.
However sometimes it reports incorrect values while cromshell status reports the correct ones.
Even after running cromshell status with a specific job id, cromshell list -u keep listing the old incorrect status.

The implication is that the status reported by cromshell list -u is unreliable.
This could lead to job keep running silently while the user believe that those job were terminated and therefore this is a critical bug.

I have not figure out how to replicate the problem.
However here there are 8 examples of jobs that are listed as running but are in fact terminated.

Screen Shot 2021-11-29 at 9 32 18 AM

Huh.... I wonder why thats happening.

Probably has to do with how the TSV gets updated when you query / update it.

Somewhere in teh status function the ~/.cromshell/<TSV> file is updated. That's almost certainly where the problem lies.

Priority of for list -u in cromshell 2.0 bumped. @bshifaw

I have just noted that the jobs with the wrong status are present in 3 tsv files. Could that be part of the problem?

Screen Shot 2021-11-03 at 3 12 49 PM

You can place this script in your .cromshell directory to check the status of your jobs. It simply runs cromshell status in a loop.

  1 #!/bin/bash
  2 cat all.workflow.database.tsv | awk '{print $(NF-2)}' | sort | uniq > id_to_check.txt  #check only most current ids
  3 # cat all.workflow.database.tsv* | awk '{print $(NF-2)}' | sort | uniq > id_to_check.txt # check all ids
  4 lines=$( cat id_to_check.txt )
  5 
  6 
  7 rm -rf status.txt
  8 for job_id in $lines
  9 do
 10 >-------if [ $job_id != 'WDL_NAME' ]; then
 11 >------->-------status=$(cromshell status $job_id | grep "status" )
 12 >------->-------echo $job_id $status >> status.txt
 13 >-------fi
 14 done
 15 
 16 echo "The following jobs are running:"
 17 cat status.txt | grep "unning"

The multiple files shouldn't be an issue - it should only be looking in all.workflow.database.tsv.

I'll take a look at this very soon.