pegasus-isi/pegasus

[PM-1998] pegasus-statistics lists incomplete jobs in a successful aws batch wf run

Closed this issue · 4 comments

when a workflow is run on aws batch run, pegasus-statistics reports incomplete jobs in the job breakdown

 

for a diamond workflow run 046-aws-batch-black workflow test

------------------------------------------------------------------------------
Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
Tasks          3         0       1           4         0         3            
Jobs           18        0       0           18        0         18           
Sub-Workflows  0         0       0           0         0         0            
------------------------------------------------------------------------------


Workflow wall time                                       : 15 mins, 58 secs
Cumulative job wall time                                 : 8 mins, 47 secs
Cumulative job wall time as seen from submit side        : 15 mins, 2 secs
Cumulative job badput wall time                          : 0.0 secs
Cumulative job badput wall time as seen from submit side : 0.0 secs 

Reporter: @vahi
Resolution: Fixed
Watchers:
@vahi

Author: @vahi

on further investigation, it seems that monitord was not parsing the clustered job output (in case of aws batch) all jobs in the workflow get clustered and then are run via pegasus-aws-batch . it seems because in the .out files the records that pegasus-aws-batch puts in between invocation records 

 

[bamboo@bamboo 046-aws-batch-black]$ grep "\[ id" dags/bamboo/pegasus/diamond/run0002/00/00/merge_diamond-findrange-4_0_PID2_ID1.out.000 
[ id=2, name="findrange_ID0000003", aws-job-id="c4356fb9-a5e6-4ff7-8ebf-dad40b6555f9", state=succeeded, status=0, id=2, app="pegasus-aws-batch-launch.sh"]
[ id=1, name="findrange_ID0000002", aws-job-id="8bb8e978-f65c-4557-a24b-d91b9edd7900", state=succeeded, status=0, id=1, app="pegasus-aws-batch-launch.sh"]
 

whereas for a job that is executed using pegasus-cluster the records are as follows

[bamboo@bamboo 044-singularity-nonsharedfs-minimal]$ grep "\[cluster-task" dags/bamboo/pegasus/diamond/run0002/00/00/merge_diamond-findrange-4_0_PID2_ID1.out.000 
[cluster-task id=1, start="2024-04-16T11:06:15.419-07:00", duration=60.087, status=0, line=2, pid=4186000, app="./diamond-findrange-4.0"]
[cluster-task id=2, start="2024-04-16T11:07:15.507-07:00", duration=60.076, status=0, line=4, pid=4186178, app="./diamond-findrange-4.0"]
 

the pegasus-aws-batch records are missing the cluster-task attribute/prefix

Author: @vahi

after updating the job out file to include cluster-task record, the pegasus-exitcode now fails for a missing cluster-summary record

 more diamond-0.exitcode.log 
{
{"name": "./00/00/merge_diamond-preprocess-4_0_PID1_ID1.out", "timestamp": "2024-11-19T09:02:10.179135", "exitcode": 1, "app_exitcode": 0, "retry": 0, "std_out": "", "std_err": "cluster-summary is missing\n"}
{"name": "./00/00/merge_diamond-preprocess-4_0_PID1_ID1.out", "timestamp": "2024-11-19T09:07:01.276999", "exitcode": 1, "app_exitcode": 0, "retry": 1, "std_out": "", "std_err": "cluster-summary is missing\n"}
 

a sample record from pegasus-cluster is 

[cluster-summary stat="ok", lines=4, tasks=2, succeeded=2, failed=0, extra=0, duration=120.165, start="2024-04-16T11:06:15.419-07:00", pid=4185999, app="pegasus-cluster"] 

Author: @vahi

in the job log file (not the job stdout) pegasus-aws-batch does write out

 

bamboo@bamboo run0003]$ grep cluster-summary 00/00/merge_diamond-preprocess-4_0_PID1_ID1.log.000 
2024-11-19 09:02:07.819 INFO  [Synch] [cluster-summary tasks=1, succeeded=1, failed=0 ]
 

Author: @vahi

pegasus-aws-batch will now long a cluster summary record similar to what pegasus-cluster does

 

[cluster-summary tasks=1, succeeded=1, failed=0, duration=364.542, start="2024-11-19T15:29:44-08:00", app="pegasus-aws-batch"]