/transform

Python code for implementing transforms on data extracted from DCs

Primary LanguagePythonApache License 2.0Apache-2.0

This repo contains all the code currently used for CDA ETL flows from CRDC DCs into the central CDA metadatabase.

This is beta code: all of it is under active development, but we have reached a point where it can stably generate CDA data releases.

Planned code-level updates include:

  • investigate all explicit DC-level cross-references; validate, compare to auto-detected identities, report coverage, integrate as/where needed and safe to do so:
    • PDC Reference entity
    • PDC Sample.gdc_sample_id
    • PDC Sample.gdc_project_id
    • IDC tcga_clinical_rel9.case_gdc_id
    • IDC tcga_biospecimen_rel9.sample_gdc_id
    • IDC tcga_biospecimen_rel9.sample_barcode
    • IDC tcga_biospecimen_rel9.case_gdc_id
  • merge CDA Specimen records across DCs using crossrefs
  • update loader object
    • make pre- and post-INSERT SQL scripts to handle index management around main ingest
  • cache more identifiers
    • IDC dicom_all.crdc_instance_uuid
    • all from list of cross-ref fields above
    • check for others
  • build architecture for ingest audit trails
    • encapsulate into a single phase within each DC flow
    • includes "as stored" and "as indexed" data exposed to user
  • build CDA release metadata table
    • version metadata
    • source-field provenance metadata
    • precomputed count stats for release data
  • add GDC index files as regular file records
  • Collect GDC transformation-phase code into a Python object
  • build mutation processing into ETL
  • collect all PDC ETL flow code into Python objects
  • create ingest flow for CDS
  • synchronize flow deployment with CDA's migration to an RDBMS instance from BigQuery
  • port entire system to cloud operations using AirFlow