transform: A Python repository from Cancer Data Aggregator

This repo contains all the code currently used for CDA ETL flows from CRDC DCs into the central CDA metadatabase.

This is beta code: all of it is under active development, but we have reached a point where it can stably generate CDA data releases.

Planned code-level updates include:

investigate all explicit DC-level cross-references; validate, compare to auto-detected identities, report coverage, integrate as/where needed and safe to do so:
- PDC Reference entity
- PDC Sample.gdc_sample_id
- PDC Sample.gdc_project_id
- IDC tcga_clinical_rel9.case_gdc_id
- IDC tcga_biospecimen_rel9.sample_gdc_id
- IDC tcga_biospecimen_rel9.sample_barcode
- IDC tcga_biospecimen_rel9.case_gdc_id
merge CDA Specimen records across DCs using crossrefs
update loader object
- make pre- and post-INSERT SQL scripts to handle index management around main ingest
cache more identifiers
- IDC dicom_all.crdc_instance_uuid
- all from list of cross-ref fields above
- check for others
build architecture for ingest audit trails
- encapsulate into a single phase within each DC flow
- includes "as stored" and "as indexed" data exposed to user
build CDA release metadata table
- version metadata
- source-field provenance metadata
- precomputed count stats for release data
add GDC index files as regular file records
Collect GDC transformation-phase code into a Python object
build mutation processing into ETL
collect all PDC ETL flow code into Python objects
create ingest flow for CDS
synchronize flow deployment with CDA's migration to an RDBMS instance from BigQuery
port entire system to cloud operations using AirFlow

CancerDataAggregator/transform