refactor intermediate data formats
bryantChhun opened this issue · 0 comments
bryantChhun commented
Issue
Before we can think about enhanced parallelization and pytorch dataloaders, we need to rethink the data formats for dynamorph.
For each stage of the pipeline, we should define the data type inputs and outputs better (file format, dimensionality, file name)
considerations
We primarily need:
- data consistency between each stage
- parallelization
- efficiency (compute and loading. zarr caching?)
questions
Can we avoid data duplication? Are there intermediate stages that can avoid data duplication?