Set up a Google Cloud Run job to validate unreleased data files.
Scheduled to run daily at 0400 ET.
Results from the validation scripts are stored in the htan-dcc.data_release
Google BigQuery dataset. These tables serve as the foundation for generating the final list of releasable HTAN portal files (see: SOP: Data Release Prep).
Requires access to deploy resources in the HTAN Google Cloud Project, htan-dcc
. Please contact an owner of htan-dcc
to request access (Owners in 2024: Clarisse Lau, Vesteinn Thorsson, William Longabaugh, ISB)
-
Create a Synapse Auth Token secret in Secret Manager. Requires download access to all individual HTAN-center Synapse projects. Currently uses
synapse-service-HTAN-lambda
service account. -
Install Terraform >= 1.7.0
Before creating job, build and push a docker image to Google Artifact Registry (recommended)
cd src
docker build . -t us-docker.pkg.dev/<gc-project>/gcr.io/<image-name>
docker push us-docker.pkg.dev/<gc-project>/gcr.io/<image-name>
Define variables in terraform.tfvars. Variable descriptions can be found in variables.tf
terraform init
terraform plan
terraform apply
The release validation scripts implement a number of file-level checks including:
- Unique
HTAN Data File ID
- Unique
HTAN Biospecimen ID
- Unique
HTAN Participant ID
within demographics manifest(s) - Compliance of
HTAN Data File ID
with HTAN ID SOP format - Existence of Synapse ID provided in Synapse metadata
- Existence of listed Adjacent Biospecimen ID as biospecimen entity
- Presence of Parent IDs
- Minimum
DependsOn
attributes are present in metadata manifest
Files failing any of the above checks are added to error output table: htan-dcc.data_release.errors
- Uniqueness of base filenames
- Equivalence of a file's Synapse name, alias, and bucket basename
- Identification of non-data-model columns added to a manifest