ASFHyP3/hyp3

Hawaii jobs for INSAR_ISCE_TEST are failing

jhkennedy opened this issue · 8 comments

These jobs are failing with what looks like memory/disk error issues and JPL would like us to investigate.

previously Hawaii jobs have failed with disk space issues, so we restricted c6id.xlarge (dropping c5d.xlarge) to ensure more disk was avialble.

[
  {
    "job_id": "fa5bea5e-cfb4-407c-b2a5-44f74b7bb0b1",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19224,
      "granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987",
        "S1A_IW_SLC__1SDV_20230205T043120_20230205T043148_047096_05A65E_5D6F"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230112T043053_20230112T043123_046746_059A9B_C858",
        "S1A_IW_SLC__1SDV_20230112T043120_20230112T043148_046746_059A9B_7AB8"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/fa5bea5e-cfb4-407c-b2a5-44f74b7bb0b1/fa5bea5e-cfb4-407c-b2a5-44f74b7bb0b1.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2473.484
    ]
  },
  {
    "job_id": "2278d437-c706-42cd-8a2b-bdbc13e77909",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19223,
      "granules": [
        "S1A_IW_SLC__1SDV_20230217T043052_20230217T043122_047271_05AC44_293F"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230124T043053_20230124T043123_046921_05A089_864A"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/2278d437-c706-42cd-8a2b-bdbc13e77909/2278d437-c706-42cd-8a2b-bdbc13e77909.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2527.366
    ]
  },
  {
    "job_id": "bb26de21-14bd-4332-a11b-c3a00417b274",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19223,
      "granules": [
        "S1A_IW_SLC__1SDV_20230217T043052_20230217T043122_047271_05AC44_293F"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/bb26de21-14bd-4332-a11b-c3a00417b274/bb26de21-14bd-4332-a11b-c3a00417b274.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2522.194
    ]
  },
  {
    "job_id": "4af0661f-b73c-4e41-97e2-602c6ac2190b",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19224,
      "granules": [
        "S1A_IW_SLC__1SDV_20230217T043052_20230217T043122_047271_05AC44_293F",
        "S1A_IW_SLC__1SDV_20230217T043119_20230217T043147_047271_05AC44_56B5"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987",
        "S1A_IW_SLC__1SDV_20230205T043120_20230205T043148_047096_05A65E_5D6F"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/4af0661f-b73c-4e41-97e2-602c6ac2190b/4af0661f-b73c-4e41-97e2-602c6ac2190b.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2522.408
    ]
  },
  {
    "job_id": "7f9e744d-f1ba-40cd-af4a-104536ce2a42",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19224,
      "granules": [
        "S1A_IW_SLC__1SDV_20230124T043053_20230124T043123_046921_05A089_864A",
        "S1A_IW_SLC__1SDV_20230124T043121_20230124T043148_046921_05A089_74AF"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230112T043053_20230112T043123_046746_059A9B_C858",
        "S1A_IW_SLC__1SDV_20230112T043120_20230112T043148_046746_059A9B_7AB8"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/7f9e744d-f1ba-40cd-af4a-104536ce2a42/7f9e744d-f1ba-40cd-af4a-104536ce2a42.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      491.125
    ]
  },
  {
    "job_id": "5c6d99a0-43c6-4a05-9a4a-d22e65ddf7e0",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19223,
      "granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230112T043053_20230112T043123_046746_059A9B_C858"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/5c6d99a0-43c6-4a05-9a4a-d22e65ddf7e0/5c6d99a0-43c6-4a05-9a4a-d22e65ddf7e0.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2522.276
    ]
  },
  {
    "job_id": "bebc172b-710a-4407-8bb9-89f8b000a7ad",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19223,
      "granules": [
        "S1A_IW_SLC__1SDV_20230124T043053_20230124T043123_046921_05A089_864A"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230112T043053_20230112T043123_046746_059A9B_C858"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/bebc172b-710a-4407-8bb9-89f8b000a7ad/bebc172b-710a-4407-8bb9-89f8b000a7ad.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2522.679
    ]
  },
  {
    "job_id": "58a5889b-bc79-428d-b523-3ad847b434e5",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19223,
      "granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230124T043053_20230124T043123_046921_05A089_864A"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/58a5889b-bc79-428d-b523-3ad847b434e5/58a5889b-bc79-428d-b523-3ad847b434e5.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2528.036
    ]
  },
  {
    "job_id": "33b345e8-dbcf-47e9-8e56-c5c7d4ad65bb",
    "job_type": "INSAR_ISCE_TEST",
    "request_time": "2023-02-28T18:16:36+00:00",
    "status_code": "FAILED",
    "user_id": "cmarshak",
    "name": "Hawaii_124_beta_GMAO",
    "job_parameters": {
      "estimate_ionosphere_delay": true,
      "frame_id": 19224,
      "granules": [
        "S1A_IW_SLC__1SDV_20230205T043052_20230205T043122_047096_05A65E_2987",
        "S1A_IW_SLC__1SDV_20230205T043120_20230205T043148_047096_05A65E_5D6F"
      ],
      "secondary_granules": [
        "S1A_IW_SLC__1SDV_20230124T043053_20230124T043123_046921_05A089_864A",
        "S1A_IW_SLC__1SDV_20230124T043121_20230124T043148_046921_05A089_74AF"
      ],
      "weather_model": "GMAO"
    },
    "logs": [
      "https://hyp3-a19-jpl-contentbucket-1wfnatpznlg8b.s3.us-west-2.amazonaws.com/33b345e8-dbcf-47e9-8e56-c5c7d4ad65bb/33b345e8-dbcf-47e9-8e56-c5c7d4ad65bb.log"
    ],
    "expiration_time": "2023-08-28T00:00:00+00:00",
    "processing_times": [
      2527.33
    ]
  }
]

All those jobs failed again, as expected.

First two jobs failed because Host EC2 instance terminated. after 21600 seconds. Is 21600 our cut-off? Checking on the rest...

Yep, all failed with this same error and 21600 s is the INSAR_ISCE_TEST troposphere step timeout length

Importantly, these all failed during the RAiDER (troposphere) step:
image

It looks like a RAiDER error -- here is the log for the RAiDER step:
image

This line is throwing the error:
https://github.com/dbekaert/RAiDER/blob/b2d98bee9ad92f470993d17ff54fb9d15476e5f5/tools/RAiDER/aria/prepFromGUNW.py#L248

The failure path is the same for all of these jobs.

Since this is a RAiDER issue, I'm going to close this as done.