This repo contains all the code currently used for CDA ETL flows from CRDC DCs into the central CDA metadatabase.
This is beta code: all of it is under active development, but we have reached a point where it can stably generate CDA data releases.
Planned code-level updates include:
- investigate all explicit DC-level cross-references; validate, compare to auto-detected identities, report coverage, integrate as/where needed and safe to do so:
- PDC Reference entity
- PDC Sample.gdc_sample_id
- PDC Sample.gdc_project_id
- IDC tcga_clinical_rel9.case_gdc_id
- IDC tcga_biospecimen_rel9.sample_gdc_id
- IDC tcga_biospecimen_rel9.sample_barcode
- IDC tcga_biospecimen_rel9.case_gdc_id
- merge CDA Specimen records across DCs using crossrefs
- update loader object
- make pre- and post-INSERT SQL scripts to handle index management around main ingest
- cache more identifiers
- IDC dicom_all.crdc_instance_uuid
- all from list of cross-ref fields above
- check for others
- build architecture for ingest audit trails
- encapsulate into a single phase within each DC flow
- includes "as stored" and "as indexed" data exposed to user
- build CDA release metadata table
- version metadata
- source-field provenance metadata
- precomputed count stats for release data
- add GDC index files as regular file records
- Collect GDC transformation-phase code into a Python object
- build mutation processing into ETL
- collect all PDC ETL flow code into Python objects
- create ingest flow for CDS
- synchronize flow deployment with CDA's migration to an RDBMS instance from BigQuery
- port entire system to cloud operations using AirFlow