nasa/opera-sds-pcm

[Bug]: DISP-S1 SCIFLO PGE fails upon encountering any ASF S3 direct ingestion issues without any retry concept

Opened this issue · 4 comments

Checked for duplicates

Yes - I've already checked

Describe the bug

DISP-S1 SCIFLO PGE is unique from other SCIFLO PGEs in that it ingests input files directly from ASF S3 without an upstream download job having to stage first into OPERA S3. This design saves a great deal of time and cost and clearly eliminates redundant data transfer.

The data staging is done through a lower-level hySDS component, "chimera" or something like that. This component doesn't seem to have the concept of intermittent network issues - it fails if it can't complete staging. When this component call fails, DISP-S1 SCIFLO PGE also fails because it too does not have concept of retrying data staging. When we run into this the call stack looks like the following. In the past few weeks of testing this error happened a couple dozen times. When this happens the human operator has to 1) recognize that the error has occurred and 2) retry the SCIFLO manually.

Unfortunately, the way the short error is formed, it makes it difficult/impossible to make a Mozart retry rule.

Traceback (most recent call last):
  File "/home/ops/verdi/ops/opera-pcm/opera_chimera/opera_pge_job_submitter.py", line 101, in perform_adaptation_tasks
    download_file(url, path)
  File "/home/ops/verdi/ops/hysds-1.2.12/hysds/utils.py", line 185, in download_file
    shutil.rmtree(path + ".osaka.locked.json")
  File "/opt/conda/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/opt/conda/lib/python3.9/shutil.py", line 642, in _rmtree_safe_fd
    onerror(os.scandir, path, sys.exc_info())
  File "/opt/conda/lib/python3.9/shutil.py", line 638, in _rmtree_safe_fd
    with os.scandir(topfd) as scandir_it:
NotADirectoryError: [Errno 20] Not a directory: '/data/work/jobs/2024/09/25/00/00/job-WF-SCIFLO_L3_DISP_S1-frame-7078-latest_acq_index-900_hist-20240924T215855.810223Z/OPERA_L2_CSLC-S1_T027-056625-IW2_20240217T132356Z_20240604T170147Z_S1A_VV_v1.1.osaka.locked.json'

What did you expect?

We should do one or both of the following:

  1. Figure out a way for DISP-S1 SCIFLO PGE to retry data staging without failing
  2. Concoct a retry rule for Mozart when this happens. But this solution alone is not ideal because when the job fails it will have executed for ~1000 seconds already so we'd be wasting that computation time.

Reproducible steps

1. Run any job that would trigger DISP-S1 SCIFLO PGE
2. Sometimes it will fail w the "NotADirectoryError" error. I have not been able to spot any pattern on when it fails.

Environment

- Version of this software [e.g. vX.Y.Z]
- Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

The underlying error that causes this is 104

Traceback (most recent call last):
  File "/home/ops/verdi/ops/hysds-1.2.12/hysds/utils.py", line 178, in download_file
    return osaka.main.get(url, path, params=params)
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/main.py", line 76, in get
    transfer(
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/main.py", line 118, in transfer
    transfer.transfer(
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 128, in transfer
    raise err
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 106, in transfer
    metrics = self.transfer_uri(source, shandle, dest, dhandle)
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 161, in transfer_uri
    counts = osaka.utils.product_composite_iterator(source, shandle, transfer_one)
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/utils.py", line 123, in product_composite_iterator
    return list(map(callback, uris))
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 156, in transfer_one
    stream = shandle.get(uri)
  File "/home/ops/verdi/ops/osaka-1.2.3/osaka/storage/s3.py", line 218, in get
    obj.download_fileobj(fh)
  File "/home/ops/verdi/lib/python3.9/site-packages/boto3/s3/inject.py", line 884, in object_download_fileobj
    return self.meta.client.download_fileobj(
  File "/home/ops/verdi/lib/python3.9/site-packages/boto3/s3/inject.py", line 795, in download_fileobj
    return future.result()
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/tasks.py", line 139, in __call__
    return self._execute_main(kwargs)
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/tasks.py", line 162, in _execute_main
    return_value = self._main(**kwargs)
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/download.py", line 583, in _main
    for chunk in chunks:
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/download.py", line 728, in __next__
    chunk = self._body.read(self._chunksize)
  File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/utils.py", line 599, in read
    value = self._stream.read(*args, **kwargs)
  File "/home/ops/verdi/lib/python3.9/site-packages/botocore/response.py", line 104, in read
    raise ResponseStreamingError(error=e)
botocore.exceptions.ResponseStreamingError: An error occurred while reading from response stream: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

A couple observations related to this issue:

  • Notice that at the top of the stack trace for the NotADirectoryError is a file within our repo (and therefore within our control):
    File "/home/ops/verdi/ops/opera-pcm/opera_chimera/opera_pge_job_submitter.py", line 101, in perform_adaptation_tasks download_file(url, path)
    We should be able to wrap opera_pge_job_submitter.perform_adaptation_tasks with a backoff/retry decorator so we can perform automatic download retries when there are connection outages with ASF/S3
  • Regarding the misleading NotADirectoryError we've been seeing, I think it can be traced back to these lines of codes in the baseline hysds package: https://github.com/hysds/hysds/blob/develop/hysds/utils.py#L179-L182
    Notice that on line 179 that the directory tree rooted at path is deleted by a call to shutil.rmtree, but from the stack trace, the error is getting raised from the shutil.rmtree call at line 185. So this seems to be a race condition where the conditional at line 180 is still satisfied after the rmtree call on 179, but by the time the code reaches the second rmtree call on line 185 the path + .osaka.locked.json file is already deleted, hence the NotADirectoryError getting raised.
    This is all to say that I think we should open a bug report for this on the hysds repository so it can be fixed and we can focus on catching the actual ResponseStreamingError that's been plaguing DISP-S1

For this case with the PGE doing S3 access inside the PGE, suggest two things that can help:

  1. In the PGE, wrap the S3 download call with a try-catch, and use exponential-randomized-backoff for retries. Keep the retries to a finite max number, like 10 or so, to prevent run-aways.

  2. In the try-except catch clause, after a finite number of backed-off retries was attempted, error out with a higher-level error, and do not let the code continue as it appears to have led to the NotADirectoryError, which is less intuitive for operators. Instead, catch the error and write out your own higher-level logical error like "unable to download ... after n retries". The errors can be saved to the work-dir with files:
    • "_alt_error.txt" for the short-error message
    • "_alt_traceback.txt" for full stack trace.

The verdi job worker that manages the PGE will look for "_alt_error.txt" file and if it exists will propagate that as the short-error to Mozart figaro/ES to operators and for automated behavior via trigger rules. If file "_alt_traceback.txt" exists, it will propagate that as the full stack trace error.

See the following:

Notes from Matt Perry at ASF:

You could try increasing your multipart chunk size in your boto3 TransferConfig setup

you could be hitting the s3 api too much. I think there is rate limits there..
Default Rate Limits: Amazon S3 has default rate limits for each prefix in a bucket. The limits are:
PUT/COPY/POST/DELETE: at least 3,500 requests per second
GET/HEAD: at least 5,500 requests per second
Although id expect you would see 429 ERRORs if that was the case

We have found that a 128MB multipart chunk size works well for us.