Pipeline Configuration Improvements
mmcdermott opened this issue · 1 comments
mmcdermott commented
Right now, the pipeline configuration across multiple stages, while being good overall, has some non-trivial problems:
- Each stage can only be data or metadata, not both, based on how output directories work. To fix this, all stages should store data outputs in
$output_dir/data
and metadata outputs in$output_dir/metadata
, like the overall MEDS directory. - Stages that end up doing nothing (e.g., extract metadata if there is no metadata block, e.g., #154), will yield empty directories that will confuse subsequent stages. Instead, subsequent stages should (somehow) know to look backwards through prior stages to find their input when output directories are empty or not properly constructed maybe? Or empty stages should just symlink their inputs to their outputs? It is unclear.
mmcdermott commented
Tagging @Oufattole for tracking