[Bug]: DISP-S1 SCIFLO PGE fails upon encountering any ASF S3 direct ingestion issues without any retry concept
Opened this issue · 4 comments
Checked for duplicates
Yes - I've already checked
Describe the bug
DISP-S1 SCIFLO PGE is unique from other SCIFLO PGEs in that it ingests input files directly from ASF S3 without an upstream download job having to stage first into OPERA S3. This design saves a great deal of time and cost and clearly eliminates redundant data transfer.
The data staging is done through a lower-level hySDS component, "chimera" or something like that. This component doesn't seem to have the concept of intermittent network issues - it fails if it can't complete staging. When this component call fails, DISP-S1 SCIFLO PGE also fails because it too does not have concept of retrying data staging. When we run into this the call stack looks like the following. In the past few weeks of testing this error happened a couple dozen times. When this happens the human operator has to 1) recognize that the error has occurred and 2) retry the SCIFLO manually.
Unfortunately, the way the short error is formed, it makes it difficult/impossible to make a Mozart retry rule.
Traceback (most recent call last):
File "/home/ops/verdi/ops/opera-pcm/opera_chimera/opera_pge_job_submitter.py", line 101, in perform_adaptation_tasks
download_file(url, path)
File "/home/ops/verdi/ops/hysds-1.2.12/hysds/utils.py", line 185, in download_file
shutil.rmtree(path + ".osaka.locked.json")
File "/opt/conda/lib/python3.9/shutil.py", line 734, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/opt/conda/lib/python3.9/shutil.py", line 642, in _rmtree_safe_fd
onerror(os.scandir, path, sys.exc_info())
File "/opt/conda/lib/python3.9/shutil.py", line 638, in _rmtree_safe_fd
with os.scandir(topfd) as scandir_it:
NotADirectoryError: [Errno 20] Not a directory: '/data/work/jobs/2024/09/25/00/00/job-WF-SCIFLO_L3_DISP_S1-frame-7078-latest_acq_index-900_hist-20240924T215855.810223Z/OPERA_L2_CSLC-S1_T027-056625-IW2_20240217T132356Z_20240604T170147Z_S1A_VV_v1.1.osaka.locked.json'
What did you expect?
We should do one or both of the following:
- Figure out a way for DISP-S1 SCIFLO PGE to retry data staging without failing
- Concoct a retry rule for Mozart when this happens. But this solution alone is not ideal because when the job fails it will have executed for ~1000 seconds already so we'd be wasting that computation time.
Reproducible steps
1. Run any job that would trigger DISP-S1 SCIFLO PGE
2. Sometimes it will fail w the "NotADirectoryError" error. I have not been able to spot any pattern on when it fails.
Environment
- Version of this software [e.g. vX.Y.Z]
- Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...
The underlying error that causes this is 104
Traceback (most recent call last):
File "/home/ops/verdi/ops/hysds-1.2.12/hysds/utils.py", line 178, in download_file
return osaka.main.get(url, path, params=params)
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/main.py", line 76, in get
transfer(
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/main.py", line 118, in transfer
transfer.transfer(
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 128, in transfer
raise err
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 106, in transfer
metrics = self.transfer_uri(source, shandle, dest, dhandle)
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 161, in transfer_uri
counts = osaka.utils.product_composite_iterator(source, shandle, transfer_one)
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/utils.py", line 123, in product_composite_iterator
return list(map(callback, uris))
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/transfer.py", line 156, in transfer_one
stream = shandle.get(uri)
File "/home/ops/verdi/ops/osaka-1.2.3/osaka/storage/s3.py", line 218, in get
obj.download_fileobj(fh)
File "/home/ops/verdi/lib/python3.9/site-packages/boto3/s3/inject.py", line 884, in object_download_fileobj
return self.meta.client.download_fileobj(
File "/home/ops/verdi/lib/python3.9/site-packages/boto3/s3/inject.py", line 795, in download_fileobj
return future.result()
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
raise self._exception
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/tasks.py", line 139, in __call__
return self._execute_main(kwargs)
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/tasks.py", line 162, in _execute_main
return_value = self._main(**kwargs)
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/download.py", line 583, in _main
for chunk in chunks:
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/download.py", line 728, in __next__
chunk = self._body.read(self._chunksize)
File "/home/ops/verdi/lib/python3.9/site-packages/s3transfer/utils.py", line 599, in read
value = self._stream.read(*args, **kwargs)
File "/home/ops/verdi/lib/python3.9/site-packages/botocore/response.py", line 104, in read
raise ResponseStreamingError(error=e)
botocore.exceptions.ResponseStreamingError: An error occurred while reading from response stream: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
A couple observations related to this issue:
- Notice that at the top of the stack trace for the
NotADirectoryError
is a file within our repo (and therefore within our control):
File "/home/ops/verdi/ops/opera-pcm/opera_chimera/opera_pge_job_submitter.py", line 101, in perform_adaptation_tasks download_file(url, path)
We should be able to wrapopera_pge_job_submitter.perform_adaptation_tasks
with a backoff/retry decorator so we can perform automatic download retries when there are connection outages with ASF/S3 - Regarding the misleading
NotADirectoryError
we've been seeing, I think it can be traced back to these lines of codes in the baselinehysds
package: https://github.com/hysds/hysds/blob/develop/hysds/utils.py#L179-L182
Notice that on line 179 that the directory tree rooted atpath
is deleted by a call toshutil.rmtree
, but from the stack trace, the error is getting raised from theshutil.rmtree
call at line 185. So this seems to be a race condition where the conditional at line 180 is still satisfied after thermtree
call on 179, but by the time the code reaches the secondrmtree
call on line 185 thepath + .osaka.locked.json
file is already deleted, hence theNotADirectoryError
getting raised.
This is all to say that I think we should open a bug report for this on thehysds
repository so it can be fixed and we can focus on catching the actualResponseStreamingError
that's been plaguing DISP-S1
For this case with the PGE doing S3 access inside the PGE, suggest two things that can help:
-
In the PGE, wrap the S3 download call with a try-catch, and use exponential-randomized-backoff for retries. Keep the retries to a finite max number, like 10 or so, to prevent run-aways.
-
In the try-except catch clause, after a finite number of backed-off retries was attempted, error out with a higher-level error, and do not let the code continue as it appears to have led to the NotADirectoryError, which is less intuitive for operators. Instead, catch the error and write out your own higher-level logical error like "unable to download ... after n retries". The errors can be saved to the work-dir with files:
• "_alt_error.txt
" for the short-error message
• "_alt_traceback.txt
" for full stack trace.
The verdi job worker that manages the PGE will look for "_alt_error.txt
" file and if it exists will propagate that as the short-error to Mozart figaro/ES to operators and for automated behavior via trigger rules. If file "_alt_traceback.txt
" exists, it will propagate that as the full stack trace error.
See the following:
- overwrite error if _alt_error.txt was dumped
- overwrite traceback if _alt_traceback.txt was dumped
Notes from Matt Perry at ASF:
You could try increasing your multipart chunk size in your boto3 TransferConfig setup
you could be hitting the s3 api too much. I think there is rate limits there..
Default Rate Limits: Amazon S3 has default rate limits for each prefix in a bucket. The limits are:
PUT/COPY/POST/DELETE: at least 3,500 requests per second
GET/HEAD: at least 5,500 requests per second
Although id expect you would see 429 ERRORs if that was the case
We have found that a 128MB multipart chunk size works well for us.