data-release-cloud-run

Set up a Google Cloud Run job to validate unreleased data files.

Scheduled to run daily at 0400 ET.

Results from the validation scripts are stored in the htan-dcc.data_release Google BigQuery dataset. These tables serve as the foundation for generating the final list of releasable HTAN portal files.

Requirements

Requires access to deploy resources in the HTAN Google Cloud Project, htan-dcc. Please contact an owner of htan-dcc to request access (Owners in 2024: Clarisse Lau, Vesteinn Thorsson, William Longabaugh, ISB)

Prerequisites

Create a Synapse Auth Token secret in Secret Manager. Requires download access to all individual HTAN-center Synapse projects. Currently uses synapse-service-HTAN-lambda service account.
Install Terraform >= 1.7.0

Docker Image

Before creating job, build and push a docker image to Google Artifact Registry (recommended)

cd src
docker build . -t us-docker.pkg.dev/<gc-project>/gcr.io/<image-name>
docker push us-docker.pkg.dev/<gc-project>/gcr.io/<image-name>

Deploy Cloud Resources

Define variables in terraform.tfvars. Variable descriptions can be found in variables.tf

terraform init
terraform plan
terraform apply

Validation Checks

The release validation scripts implement a number of file-level checks including:

Unique HTAN Data File ID
Unique HTAN Biospecimen ID
Unique HTAN Participant ID within demographics manifest(s)
Compliance of HTAN Data File ID with HTAN ID SOP format
Existence of Synapse ID provided in Synapse metadata
Existence of listed Adjacent Biospecimen ID as biospecimen entity
Presence of Parent IDs
Minimum DependsOn attributes are present in metadata manifest

Files failing any of the above checks are added to error output table: htan-dcc.data_release.errors

Additional checks available for internal use, but not mandatory for release include:

Uniqueness of base filenames
Equivalence of a file's Synapse name, alias, and bucket basename
Identification of non-data-model columns added to a manifest