mmcdermott/MEDS_transforms

Pipeline Configuration Improvements

mmcdermott opened this issue · 1 comments

Right now, the pipeline configuration across multiple stages, while being good overall, has some non-trivial problems:

  • Each stage can only be data or metadata, not both, based on how output directories work. To fix this, all stages should store data outputs in $output_dir/data and metadata outputs in $output_dir/metadata, like the overall MEDS directory.
  • Stages that end up doing nothing (e.g., extract metadata if there is no metadata block, e.g., #154), will yield empty directories that will confuse subsequent stages. Instead, subsequent stages should (somehow) know to look backwards through prior stages to find their input when output directories are empty or not properly constructed maybe? Or empty stages should just symlink their inputs to their outputs? It is unclear.

Tagging @Oufattole for tracking