/dmrpp-generator

An Activity to generate DMR++ files from netCDF4 and HDF files

Primary LanguagePythonApache License 2.0Apache-2.0

Coverage Status

 ____  __  __ ____  ____  ____
|  _ \|  \/  |  _ \|  _ \|  _ \
| | | | |\/| | |_) | |_) | |_) |
| |_| | |  | |  _ <|  __/|  __/
|____/|_|  |_|_| \_\_|   |_|

Overview

This repo consists of the code for the DMR++ ECS module, lambda, and a python CLI to interact with the DMR++ container.

Versioning

We are following v<major>.<minor>.<patch> versioning convention, where:

  • <major>+1 means we changed the infrastructure and/or the major components that makes this software run. Will definitely lead to breaking changes.
  • <minor>+1 means we upgraded/patched the dependencies this software relays on. Can lead to breaking changes.
  • <patch>+1 means we fixed a bug and/or added a feature. Breaking changes are not expected.

Pre-requisite

The prerequisites depend on which use case is needed.

Terraform Module

This module is meant to used within the Cumulus stack. If you don't have Cumulus stack deployed yet please consult this repo and follow the documetation to provision it.

DMR++ Python CLI

For each release after v4.1.0, there will be a python wheel published in the release assets. This can be installed and used locally via pip like the following: pip install https://github.com/ghrcdaac/dmrpp-generator/releases/download/v1.0.0-test/dmrpp_file_generator-4.1.2-py3-none-any.whl The python module uses Docker compose to generate dmrpp files locally so no other dependencies should be needed.

Deploying the Terraform module with the Cumulus Stack

In main.tf file (where you defined cumulus module) add

module "dmrpp-generator" {
 // Required Parameters
 source = "https://github.com/ghrcdaac/dmrpp-generator/releases/download/<tag_num>/dmrpp-generator.zip"
 cluster_arn = module.cumulus.ecs_cluster_arn
 region = var.region
 prefix = var.prefix
 
 // Optional Activity Parameters
 docker_image = "ghrcdaac/dmrpp-generator:<tag_num>" // default to the correct release
 cpu = 800 // default to 800
 enable_cw_logging = False // default to False
 memory_reservation = 900 // default to 900
 prefix = "Cumulus stack prefix" // default Cumulus stack prefix
 desired_count = 1  // Default to 1
 log_destination_arn = var.aws_log_mechanism // default to null
 
 // Optional Lambda Specific Configuration  
 cumulus_lambda_role_arn = module.cumulus.lambda_processing_role_arn // If provided the lambda will be provisioned
 timeout = 900
 memory_size = 256
 ephemeral_storage = 512
}

Note: When the lambda is provisioned the module will create a private ECR repository for the dmrpp_generator container. The first deployment could take +5 minutes as the image needs to be pulled from docker hub and pushed to this new ECR repository. This is a temporary work around until a public ECR repository can be created for the lambda image.

Outputs

The module returns the service id and the lambda ARN:

output "dmrpp_task_id" {
  value = module.dmrpp_service.dmrpp_task_id
}

output "dmrpp_lambda_arn" {
  value = module.dmrpp_lambda.dmrpp_lambda_arn
}

In variables.tf file you need to define

variable "dmrpp-generator-docker-image" {
  default = "ghrcdaac/dmrpp-generator:<tag_num>"
}

Assuming you already defined the region and the prefix

Add the activity/lambda to your workflow

In your workflow.tf add

   "HyraxProcessing": {
      "Parameters": {
        "cma": {
          "event.$": "$",
          "task_config": {
            "buckets": "{$.meta.buckets}",
            "distribution_endpoint": "{$.meta.distribution_endpoint}",
            "files_config": "{$.meta.collection.files}",
            "fileStagingDir": "{$.meta.collection.url_path}",
            "granuleIdExtraction": "{$.meta.collection.granuleIdExtraction}",
            "collection": "{$.meta.collection}"
          }
        }
      },
      "Type": "Task",
      "Resource": "${module.dmrpp-generator.dmrpp_task_id}",
      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "ResultPath": "$.exception",
          "Next": "WorkflowFailed"
        }
      ],
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 3
        }
      ],
      "Next": "<Your next Step>"
    }

Where <Your next Step> is the next step in your workflow.

Cumulus Collection Configuration

Add the options desired to the collection definition as follows:

{
  "config": {
    "meta": {
      "dmrpp": {
        "options": [
          {
            "flag": "-M"
          },
          {
            "flag": "-s",
            "opt": "s3://ghrcsbxw-public/dmrpp_config/file.config",
            "download": "true"
          },
          {
            "flag": "-c",
            "opt": "s3://ghrcsbxw-public/aces1cont__1/aces1cont_2002.212_v2.50.tar.cmr.json",
            "download": "false"
          }
        ]
      }
    }
  }
}

For a list of all configuration options see: https://docs.opendap.org/index.php?title=DMR%2B%2B#:~:text=4.2%20Command%20line%20options

Cumulus Workflow Configuration

If your workflow is used by multiple collections which use a common dmrpp config, the config can be set at the workflow's ${StepName}.Parameters.cma.task_config.dmrpp instead of in the collection (Note: if the workflow and collection both have a dmrpp key, the configurations will be merged together, with the collection's config overriding any keys that are found in both the workflow and collection):

# terraform

dmrpp_config = {
  options = [
    {
      flag = "-M"
    },
    {
      flag = "-s"
      opt = "s3://ghrcsbxw-public/dmrpp_config/file.config"
      download = "true"
    },
    {
      flag = "-c"
      opt = "s3://ghrcsbxw-public/aces1cont__1/aces1cont_2002.212_v2.50.tar.cmr.json"
      download = "false"
    }
  ]
}

# workflow JSON
   "HyraxProcessing": {
      "Parameters": {
        "cma": {
          "event.$": "$",
          "task_config": {
            ...
            "dmrpp": ${jsonencode(dmrpp_config)}
          }
        }
      },

    ...
    }

Timeout Configuration

The subprocess call to the BESD library has a configurable timeout value. It will default to 60 seconds if not configured. There are two ways to provide a custom value.

  1. Setting the get_dmrpp_timeout terraform variable
  2. Adding get_dmrpp_timeout to the collection definition: collection.meta.dmrpp

If the value is provided in the collection definition this will take precedence over the environment variable.

Subprocess Logging Configuration

When making the subprocess call, stdout and stderr will default to None to prevent an issue from occurring where the timeout is not respected. This can be configured in two ways.

  1. Setting the ENABLE_SUBPROCESS_LOGGING environment variable in terraform
  2. Adding enable_subprocess_logging to the collection definition: collection.meta.dmrpp.

If the value is provided in the collection definition this will take precedence over the environment variable. Can be true or false.

Verify Output Configuration

The processing code will verify that dmrpp outputs are produced and if not an exception will be raised. This behavior can be disabled if needed. This can be configured in two ways.

  1. Setting the VERIFY_OUTPUT environment variable in terraform
  2. Adding verify_output to the collection definition: collection.meta.dmrpp.

If the value is provided in the collection definition this will take precedence over the environment variable. Can be true or false.

DMR++ Python CLI

How to install

Find the version you want to use and get the asset URL for the .whl file and install like the following example command:

pip install https://github.com/ghrcdaac/dmrpp-generator/releases/download/v<release_version>/dmrpp_file_generator-<dmrpp_version>-py3-none-any.whl

Supported get_dmrpp configuration

Via env vars

Create a PAYLOAD environment variable holding dmrpp options

PAYLOAD='{"dmrpp_regex": "^.*.nc4", "options":[{"flag": "-M"}, {"flag": "-s", "opt": "s3://ghrcsbxw-public/dmrpp_config/file.config","download": "true"}]}'

dmrpp_regex is optional to override the DMRPP-Generator regex

Generate DMRpp files locally without Hyrax server

dmrpp now uses docker compose v2. Please update to
docker compose v2 or you will get the error
/bin/sh: 1: docker compose: not found

Overview:

$ dmrpp -h
usage: dmrpp [-h] -p NC_HDF_PATH [-prt PORT] [-pyld PAYLOAD] [--validate] [--no-validate]

Generate and validate DMRPP files. Any DMR++ commandline option can be provided in addition to the options listed below. To see what options are available check the documentation:
https://docs.opendap.org/index.php?title=DMR%2B%2B#Command_line_options

optional arguments:
  -h, --help            show this help message and exit
  -p NC_HDF_PATH, --path NC_HDF_PATH
                        Path to netCDF4 and HDF5 folder
  -prt PORT, --port PORT
                        Port number to Hyrax local server
  -pyld PAYLOAD, --payload PAYLOAD
                        Payload to pass to the besd get_dmrpp call. If not set, will check for PAYLOAD environment variable, or default to '{}'
  --validate            Validate netCDF4 and HDF5 files against OPeNDAP local server. This is the default behavior
  --no-validate         Do not validate netCDF4 and HDF5 files against OPeNDAP local server. The default behavior is --validate.

The folder <absolute/path/to/files> should contain netCDF and/or HDF files

$ dmrpp --path /path/to/inputs/ --no-validate

Generate DMRpp files locally with Hyrax server (for validation)

$ dmrpp --path /path/to/inputs/ --validate
Log file: /tmp/dmrpp-generator-13z6cizs
Results served at : http://localhost:8080/opendap (^C to kill the server)
^C
Shutting down the server...

A prompt will ask you to visit localhost:8080. If you want to change the default port run the command with

$ dmrpp --path /path/to/inputs/ --validate -prt 8889
Log file: /tmp/dmrpp-generator-34blq6gn
Results served at : http://localhost:8889/opendap (^C to kill the server)
^C
Shutting down the server...