[PM-1998] pegasus-statistics lists incomplete jobs in a successful aws batch wf run
Closed this issue · 4 comments
when a workflow is run on aws batch run, pegasus-statistics reports incomplete jobs in the job breakdown
for a diamond workflow run 046-aws-batch-black workflow test
------------------------------------------------------------------------------
Type Succeeded Failed Incomplete Total Retries Total+Retries
Tasks 3 0 1 4 0 3
Jobs 18 0 0 18 0 18
Sub-Workflows 0 0 0 0 0 0
------------------------------------------------------------------------------
Workflow wall time : 15 mins, 58 secs
Cumulative job wall time : 8 mins, 47 secs
Cumulative job wall time as seen from submit side : 15 mins, 2 secs
Cumulative job badput wall time : 0.0 secs
Cumulative job badput wall time as seen from submit side : 0.0 secs
Author: @vahi
on further investigation, it seems that monitord was not parsing the clustered job output (in case of aws batch) all jobs in the workflow get clustered and then are run via pegasus-aws-batch . it seems because in the .out files the records that pegasus-aws-batch puts in between invocation records
[bamboo@bamboo 046-aws-batch-black]$ grep "\[ id" dags/bamboo/pegasus/diamond/run0002/00/00/merge_diamond-findrange-4_0_PID2_ID1.out.000
[ id=2, name="findrange_ID0000003", aws-job-id="c4356fb9-a5e6-4ff7-8ebf-dad40b6555f9", state=succeeded, status=0, id=2, app="pegasus-aws-batch-launch.sh"]
[ id=1, name="findrange_ID0000002", aws-job-id="8bb8e978-f65c-4557-a24b-d91b9edd7900", state=succeeded, status=0, id=1, app="pegasus-aws-batch-launch.sh"]
whereas for a job that is executed using pegasus-cluster the records are as follows
[bamboo@bamboo 044-singularity-nonsharedfs-minimal]$ grep "\[cluster-task" dags/bamboo/pegasus/diamond/run0002/00/00/merge_diamond-findrange-4_0_PID2_ID1.out.000
[cluster-task id=1, start="2024-04-16T11:06:15.419-07:00", duration=60.087, status=0, line=2, pid=4186000, app="./diamond-findrange-4.0"]
[cluster-task id=2, start="2024-04-16T11:07:15.507-07:00", duration=60.076, status=0, line=4, pid=4186178, app="./diamond-findrange-4.0"]
the pegasus-aws-batch records are missing the cluster-task attribute/prefix
Author: @vahi
after updating the job out file to include cluster-task record, the pegasus-exitcode now fails for a missing cluster-summary record
more diamond-0.exitcode.log
{
{"name": "./00/00/merge_diamond-preprocess-4_0_PID1_ID1.out", "timestamp": "2024-11-19T09:02:10.179135", "exitcode": 1, "app_exitcode": 0, "retry": 0, "std_out": "", "std_err": "cluster-summary is missing\n"}
{"name": "./00/00/merge_diamond-preprocess-4_0_PID1_ID1.out", "timestamp": "2024-11-19T09:07:01.276999", "exitcode": 1, "app_exitcode": 0, "retry": 1, "std_out": "", "std_err": "cluster-summary is missing\n"}
a sample record from pegasus-cluster is
[cluster-summary stat="ok", lines=4, tasks=2, succeeded=2, failed=0, extra=0, duration=120.165, start="2024-04-16T11:06:15.419-07:00", pid=4185999, app="pegasus-cluster"]