Detect and replace obsolete AWS S3 cmr.xml
(ECHO-10 XML) metadata objects with valid cmr.json
(UMM-G JSON) metdata objects. Does not alter/write to any AWS asset (DynamoDB, RDS, OpenSearch, etc.) other than modifying S3 objects.
DAAC operators or developers using the Cumulus Framework who want to replace obsolete AWS S3 cmr.xml
metadata objects with valid cmr.json
ones.
In particular, this software assumes a 1:1 correspondence between Collections and S3 location (that is, the granules for collection x are all in the same S3 location, and granules belonging to other collections are not present in this location), and no garauntee is made that the results will be as-expected for use cases departing from this paradigm.
Minimum python version: 3.8.13+ The following external python3 libraries are required:
- boto3
- botocore
- requests
- urllib3
The necessary libraries can be installed directly into the currently-active python3 environment by issuing the following command. For most users, this will be sufficient:
$ pip install -r requirements.txt
In some scenarios (e.g., dependency resolver incompatibilities), the above approach may not satisfy your needs. Although beyond the scope of this repository/guide, solutions such as Conda can be used to create a brand new python3 environment with the necessary libraries.
Assuming you have Conda installed, you can run the following commands to create and activate a new python3 environment called s3_metadata
with the required libraries:
$ conda env create -f environment.yml
$ conda activate s3_metadata
Additionally, the executor must have a properly configured AWS Profile. The following characteristics of the collection whose files are being altered, and the target AWS environment, must be known:
- Collection shortname
- Collection version
- AWS S3 bucket name
- AWS Profile name
- S3 prefix, aka the internal S3 path within the bucket to the
cmr.xml
files
The script searches a user-specified S3 location and searches all files and builds a dictionary with the following format:
{'filename_without_json_or_xml: {'json_exists': <True/False>, 'xml_exists': <True/False>}}. The extensions are expected to be cmr.json or cmr.xml. These are always presumed to contain granule metadata. For each such object found, a call is made to the CMR API to retrieve the equivalent UMM-G JSON granule metadata, which is streamed (written to) a new cmr.json
object in the user-specified S3 location. Extant cmr.xml
objects are removed from S3 only when all cmr.json
objects have been written.
An error with one (or several) per-granule attempts should not cause other attempts to fail (unless the error is unrecoverable/etc). More details about errors and edge cases will be established as this tool sees real-world use.
STDOUT/STDERR will show progress/errors as usual. In addition, a .csv
file will be created locally. This file contains a list of the created cmr.json
object(s) with their corresponding file size(s) and can be used as a record of "what has been done".
cmr.xml
objects are removed and replacement cmr.json
objects are created in the user-specified S3 location.
The usual exit code is 0
, signifying no errors.
- If CMR does not have UMM-G JSON metadata for the granule corresponding to a particular
cmr.xml
object, an error like the following will be shown, and that granule'scmr.xml
object will not be removed, nor a replacementcmr.json
object created:
CMR returned no hits for short_name: {short_name}, granule_ur: {base_name_key}'
2. If both a cmr.xml
and cmr.json
object with the same basename already exist in the user-specified S3 location, the cmr.xml
object will be deleted and no new/replacement cmr.json
object will be created.
Issue the following command to view the help/usage output (as shown below):
$ python main.py --help
usage: main.py [-h] --short-name SHORT_NAME --version VERSION --bucket BUCKET --aws_profile AWS_PROFILE [--prefix PREFIX]
[--environment {sbx,sit,uat,prod}]
Searches a provided location for cmr.xml files, creates cmr.jsonfiles as needed, and deletes the cmr.xml files. Generates a csvwith the
creates json files and file sizes.
optional arguments:
-h, --help show this help message and exit
required arguments:
--short-name SHORT_NAME, -s SHORT_NAME
Collection short name.
--version VERSION, -v VERSION
Collection version.
--bucket BUCKET, -b BUCKET
Bucket to check.
--aws_profile AWS_PROFILE, -pr AWS_PROFILE
AWS PROFILE
--prefix PREFIX, -p PREFIX
Prefix for collection location.
--environment {sbx,sit,uat,prod}, -e {sbx,sit,uat,prod}
This S3 location has a mixture of cmr.xml
and cmr.json
objects representing granule metadata. The date ranges are non-overlapping, which is what we normally expect to see:
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep xml | wc -l
275
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep json | wc -l
117
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep xml | head
2020-08-31 14:31:50 2217 f17_ssmis_20200801v7.nc.cmr.xml
2020-08-31 14:32:43 2217 f17_ssmis_20200802v7.nc.cmr.xml
2020-08-31 14:32:10 2217 f17_ssmis_20200803v7.nc.cmr.xml
2020-08-31 14:32:14 2216 f17_ssmis_20200804v7.nc.cmr.xml
2020-08-31 14:32:00 2217 f17_ssmis_20200805v7.nc.cmr.xml
2020-08-31 14:32:11 2217 f17_ssmis_20200806v7.nc.cmr.xml
2020-08-31 14:32:35 2217 f17_ssmis_20200807v7.nc.cmr.xml
2020-08-31 14:31:55 2217 f17_ssmis_20200808v7.nc.cmr.xml
2020-08-31 14:32:23 2217 f17_ssmis_20200809v7.nc.cmr.xml
2020-08-31 14:31:58 2217 f17_ssmis_20200810v7.nc.cmr.xml
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep xml | tail
2021-05-17 15:27:42 2139 f17_ssmis_20210508v7.nc.cmr.xml
2021-05-17 15:27:58 2139 f17_ssmis_20210509v7.nc.cmr.xml
2021-05-17 15:27:47 2139 f17_ssmis_20210510v7.nc.cmr.xml
2021-05-17 15:27:42 2139 f17_ssmis_20210511v7.nc.cmr.xml
2021-05-17 15:27:51 2139 f17_ssmis_20210512v7.nc.cmr.xml
2021-05-17 15:28:03 2139 f17_ssmis_20210513v7.nc.cmr.xml
2021-05-17 19:32:41 2139 f17_ssmis_20210514v7.nc.cmr.xml
2021-05-17 19:32:43 2139 f17_ssmis_20210515v7.nc.cmr.xml
2021-05-17 19:32:46 2139 f17_ssmis_20210516v7.nc.cmr.xml
2021-05-17 19:32:46 2139 f17_ssmis_20210517v7.nc.cmr.xml
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep json | head
2021-07-28 15:21:06 2278 f17_ssmis_20210701v7.nc.cmr.json
2021-07-28 15:21:04 2277 f17_ssmis_20210702v7.nc.cmr.json
2021-07-28 15:20:45 2278 f17_ssmis_20210703v7.nc.cmr.json
2021-07-28 15:21:02 2278 f17_ssmis_20210704v7.nc.cmr.json
2021-07-28 15:21:01 2278 f17_ssmis_20210705v7.nc.cmr.json
2021-07-28 15:20:41 2278 f17_ssmis_20210706v7.nc.cmr.json
2021-07-28 15:20:58 2278 f17_ssmis_20210707v7.nc.cmr.json
2021-07-28 15:20:56 2278 f17_ssmis_20210708v7.nc.cmr.json
2021-07-28 15:20:37 2277 f17_ssmis_20210709v7.nc.cmr.json
2021-07-28 15:20:44 2278 f17_ssmis_20210710v7.nc.cmr.json
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep json | tail
2021-11-30 15:55:35 2717 f17_ssmis_20211129v7.nc.cmr.json
2021-11-30 15:55:23 2717 f17_ssmis_20211130v7.nc.cmr.json
2021-12-08 16:18:47 2716 f17_ssmis_20211201v7.nc.cmr.json
2021-12-08 16:18:33 2717 f17_ssmis_20211202v7.nc.cmr.json
2021-12-08 16:18:57 2716 f17_ssmis_20211203v7.nc.cmr.json
2021-12-08 16:18:39 2717 f17_ssmis_20211204v7.nc.cmr.json
2021-12-08 16:18:00 2716 f17_ssmis_20211205v7.nc.cmr.json
2021-12-08 16:18:43 2717 f17_ssmis_20211206v7.nc.cmr.json
2021-12-08 16:18:55 2717 f17_ssmis_20211207v7.nc.cmr.json
2021-12-08 16:18:20 2717 f17_ssmis_20211208v7.nc.cmr.json
We want to replace the 275 cmr.xml
objects with 275 cmr.json
objects, so in our case (the rssmif17d
collection, version 7
, uat
environment, ghrcwuat-public
bucket) we run the following command:
$ python main.py --short-name rssmif17d --version 7 --bucket ghrcwuat-public --aws_profile ghrcwuat --prefix rssmif17d__7 --environment uat
The full invocation + results of this example are shown below:
$ python main.py --short-name rssmif17d --version 7 --bucket ghrcwuat-public --aws_profile ghrcwuat --prefix rssmif17d__7 --environment uat
Processing: rssmif17d__7/
Checking: rssmif17d__7/f17_ssmis_20200801v7.nc.cmr.xml
Checking: rssmif17d__7/f17_ssmis_20200802v7.nc.cmr.xml
Checking: rssmif17d__7/f17_ssmis_20200803v7.nc.cmr.xml
....etc, etc, 1 "Checking" line per each cmr.xml object found....
Checking: rssmif17d__7/f17_ssmis_20210517v7.nc.cmr.xml
Checking: rssmif17d__7/f17_ssmis_20210701v7.nc.cmr.json
Checking: rssmif17d__7/f17_ssmis_20210702v7.nc.cmr.json
Checking: rssmif17d__7/f17_ssmis_20210703v7.nc.cmr.json
Checking: rssmif17d__7/f17_ssmis_20210704v7.nc.cmr.json
....etc, etc, 1 "Checking" line per each cmr.json object found....
Checking: rssmif17d__7/test/f17_ssmis_20210708v7.nc.cmr.json
Uploaded rssmif17d__7/f17_ssmis_20200801v7.nc.cmr.json
Uploaded rssmif17d__7/f17_ssmis_20200802v7.nc.cmr.json
Uploaded rssmif17d__7/f17_ssmis_20200803v7.nc.cmr.json
Uploaded rssmif17d__7/f17_ssmis_20200804v7.nc.cmr.json
....etc, etc, 1 "Uploaded blah.cmr.json" line per each new cmr.json object written....
Uploaded rssmif17d__7/f17_ssmis_20210517v7.nc.cmr.json
Deleting: {'Key': 'rssmif17d__7/f17_ssmis_20200801v7.nc.cmr.xml'}
Deleting: {'Key': 'rssmif17d__7/f17_ssmis_20200802v7.nc.cmr.xml'}
Deleting: {'Key': 'rssmif17d__7/f17_ssmis_20200803v7.nc.cmr.xml'}
Deleting: {'Key': 'rssmif17d__7/f17_ssmis_20200804v7.nc.cmr.xml'}
....etc, etc, 1 "Deleting key blah cmr.xml" line per each cmr.xml object removed....
$ $?
-bash: 0: command not found
As explained above, the software created a local csv
file when run, containing the names of each new cmr.json
object created and its size, 1 per line.
Note that we have created/written 275 cmr.json
objects, reflected by 275 items listed in this file, and this matches the initial count (275) of cmr.xml
objects that were present in the S3 locations3://ghrcwuat-public/rssmif17d__7
. This count is a useful check for determining whether the results are as-expected:
$ ls -lh rssmif17d__7.csv
-rw-r--r-- 1 sflynn staff 11K May 4 12:33 rssmif17d__7.csv
$ wc -l rssmif17d__7.csv
275 rssmif17d__7.csv
$ head rssmif17d__7.csv
f17_ssmis_20200801v7.nc.cmr.json,1.475
f17_ssmis_20200802v7.nc.cmr.json,1.475
f17_ssmis_20200803v7.nc.cmr.json,1.475
f17_ssmis_20200804v7.nc.cmr.json,1.474
f17_ssmis_20200805v7.nc.cmr.json,1.475
f17_ssmis_20200806v7.nc.cmr.json,1.475
f17_ssmis_20200807v7.nc.cmr.json,1.475
f17_ssmis_20200808v7.nc.cmr.json,1.475
f17_ssmis_20200809v7.nc.cmr.json,1.475
f17_ssmis_20200810v7.nc.cmr.json,1.475
$ tail rssmif17d__7.csv
f17_ssmis_20210508v7.nc.cmr.json,1.504
f17_ssmis_20210509v7.nc.cmr.json,1.505
f17_ssmis_20210510v7.nc.cmr.json,1.505
f17_ssmis_20210511v7.nc.cmr.json,1.505
f17_ssmis_20210512v7.nc.cmr.json,1.504
f17_ssmis_20210513v7.nc.cmr.json,1.505
f17_ssmis_20210514v7.nc.cmr.json,1.505
f17_ssmis_20210515v7.nc.cmr.json,1.505
f17_ssmis_20210516v7.nc.cmr.json,1.505
f17_ssmis_20210517v7.nc.cmr.json,1.504
Recall that we initially found 275 cmr.xml
objects and 117 cmr.json
objects at s3://ghrcwuat-public/rssmif17d__7
, for a total of 392 cmr.*
objects.
We saw above how output file rssmif17d__7.csv
contains 275 lines corresponding to 275 newly-created cmr.json
objects.
Checking S3 to verify these 275 new cmr.json
objects have been created and the 117 old cmr.xml
objects have been removed is as easy as:
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep xml | wc -l
0
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | grep json | wc -l
392
You can view the last-modified dates of the new cmr.json
files and verify that they're non-empty by listing them. Example:
$ aws s3 ls ghrcwuat-public/rssmif17d__7/ --profile ghrcwuat | head
2022-05-04 12:30:18 1475 f17_ssmis_20200801v7.nc.cmr.json
2022-05-04 12:30:18 1475 f17_ssmis_20200802v7.nc.cmr.json
2022-05-04 12:30:19 1475 f17_ssmis_20200803v7.nc.cmr.json
2022-05-04 12:30:19 1474 f17_ssmis_20200804v7.nc.cmr.json
2022-05-04 12:30:20 1475 f17_ssmis_20200805v7.nc.cmr.json
2022-05-04 12:30:21 1475 f17_ssmis_20200806v7.nc.cmr.json
2022-05-04 12:30:21 1475 f17_ssmis_20200807v7.nc.cmr.json
2022-05-04 12:30:22 1475 f17_ssmis_20200808v7.nc.cmr.json
2022-05-04 12:30:23 1475 f17_ssmis_20200809v7.nc.cmr.json