nasa/opera-sds-pcm

Reduce the number of R2 product duplicates

Closed this issue · 6 comments

Checked for duplicates

Yes - I've already checked

Alternatives considered

No - I haven't considered

Related problems

No response

Describe the feature request

OPERA SDS wants to reduce the number of R2 duplicates product genrations.
Suggested proposal:
if an SLC granule has revision=1, always process it
if an SLC granule has revision>1, check the Elasticsearch database:
if the SLC granule is found, meaning it was already processed, do NOT process it again
if the SLC granule is NOT found, meaning it was never processed, DO process it

After team discussions we decided to use an approach that takes advantage of hySDS dedup feature. hySDS dedups jobs if all parameters as a group is a duplicate of an existing job. It does so by hashing all parameters together and comparing. The key here is that we are able to override that default hashing behavior and provide our own custom hash for a particular job.

The design is to compute hash for the granule_id of the SLC granule and use that as the sole component of the hash for the download jobs. This would achieve the same result as desired in the original request because of how the SLC granules are named in ASF/CMR.

hySDS performs dedup testing in two stages:

  1. When the job is submitted it's tested for dedup against all existing Queued, Started, and Completed jobs.
  2. When the job is about to "start" it's again deduped against all Started and Completed jobs. This is so that the right job can run if a previous one had failed but wasn't retried and/or succeeded.

The vast majority of our use case will apply to scenario 1 above. That should be 99%+ of cases. Testing for the second scenario is very time-consuming because we'd have to hack the system to fail/offline/revoke jobs and then restart while timing the execution of the duplicate jobs precisely. Given that 1 is the most likely case and we are not worried about small leakage in the system, we will test only for case 1 above.

To make sure this works as expected here we will perform the following tests. on a R2.1.1 system

Test 1: Test the basics and that we are deduping against Queued, Started, and Completed states

  1. Hack query code to use the same constant hash.
  2. Submit one download job
  3. While that job is in Queued state, submit 10+ download jobs. They all should be deduped.
  4. Wait until the job moves to Started state and submit several more different download jobs. Same output expectations.
  5. Wait until the job moves to Completed state and submit several more different download jobs. Same output expectations.

Test 2: Test dedup against multiple download jobs

  1. Unhack query code from the previous test
  2. Set download asg to 3
  3. Submit 10+ download jobs. Wait until there are jobs in all 3 states: Queued, Started, Completed
  4. Submit the same 10+ download jobs. They all should have been deduped.
  5. Run a different set and make sure that they are not deduped

Test 3: Non-SLC products dedup behavior has not changed

  1. We will only test HLS granules because this fix is intended for R2 PCM. RTC and CSLC did not exist back then. When we port over this change to R3 software we will perform RTC and CSLC tests.
  2. We need to hack software again because there's only one version of a granule at the CMR at any given time. We want to see that HLS dedup still works as before.
  3. To do this, submit one HLS granule.
  4. Hack code so that the revision number is different "coming from CMR"
  5. Submit that same HLS granule. It should not be deduped.
  6. Submit 10+ HLS granules in normal forward processing and observe that they process as expected.

Test 1
python daac_data_subscriber.py query -c SENTINEL-1A_SLC --processing-mode=reprocessing --job-queue=opera-job_worker-slc_data_download --chunk-size=1 --release-version=756_dedup_r2_downloads --native-id=S1A_IW_SLC__1SDV_20230724T191223_20230724T191253_049569_05F5E3_13DB-SLC

Test 1 and Test 2
python daac_data_subscriber.py query -c SENTINEL-1A_SLC --job-queue=opera-job_worker-slc_data_download --release-version=756_dedup_r2_downloads --chunk-size=1 --start-date=2021-10-24T23:00:00Z --end-date=2021-10-24T23:40:00Z --use-temporal

python daac_data_subscriber.py query -c SENTINEL-1A_SLC --job-queue=opera-job_worker-slc_data_download --release-version=756_dedup_r2_downloads --chunk-size=1 --start-date=2021-10-24T23:40:00Z --end-date=2021-10-24T23:59:00Z --use-temporal

Test 3
python daac_data_subscriber.py query -c HLSS30 --job-queue=opera-job_worker-hls_data_download --chunk-size=1 --native-id=HLS.S30.T16QED.2023196T160829.v2.0 --release-version=756_dedup_r2_downloads
--> Revision ID is 1. Hack hard-code it to 2 when retrying

python daac_data_subscriber.py query -c HLSL30 --job-queue=opera-job_worker-hls_data_download --chunk-size=1 --start-date=2023-10-25T00:00:00Z --end-date=2023-10-25T00:10:00Z --use-temporal --release-version=756_dedup_r2_downloads

@hhlee445 This ticket has been implemented and tested as described. It's ready to be pulled in to whichever branch the newest R2 release will be cut from.

To test with R3 release software we need to also test with RTC and CSLC query/download and run test equivalent to Test 3 above

RTC
python daac_data_subscriber.py query --collection-shortname=OPERA_L2_RTC-S1_V1 --endpoint=OPS --release-version=develop --job-queue=opera-job_worker-rtc_data_download --chunk-size=1 --transfer-protocol=auto --native-id=OPERA_L2_RTC-S1_T053-112813-IW1_20240102T082658Z_20240104T195942Z_S1A_30_v1.0

AND then 1) clear RTC catalog and 2) repeat the two calls so that it generates same download job

CSLC
python daac_data_subscriber.py query -c OPERA_L2_CSLC-S1_V1 --start-date=2023-12-15T08:17:50Z --chunk-size=2 --k=2 --job-queue=opera-job_worker-cslc_data_download --processing-mode=forward --end-date=2023-12-15T08:35:59Z

python daac_data_subscriber.py query -c OPERA_L2_CSLC-S1_V1 --start-date=2023-12-15T08:36:00Z --chunk-size=2 --k=2 --job-queue=opera-job_worker-cslc_data_download --processing-mode=forward --end-date=2023-12-15T08:40:00Z

AND then 1) clear CSLC catalog and 2) repeat the two calls so that it generates same download job

@hhlee445 This change was ported into the R3 codebase in the branch 756_dedup_r2_downloads_for_r3_release