mehta-lab/dynamorph

refactor intermediate data formats

bryantChhun opened this issue · 0 comments

Issue

Before we can think about enhanced parallelization and pytorch dataloaders, we need to rethink the data formats for dynamorph.

For each stage of the pipeline, we should define the data type inputs and outputs better (file format, dimensionality, file name)

considerations

We primarily need:

  1. data consistency between each stage
  2. parallelization
  3. efficiency (compute and loading. zarr caching?)

questions

Can we avoid data duplication? Are there intermediate stages that can avoid data duplication?