asdf-format/asdf

Chunking support

Opened this issue · 2 comments

When a large ndarray is stored as binary block with compression, then the (beginning of) the whole block needs to be read and decompressed even when only a small subarray is read. "Chunking" remedies this; instead of storing an ndarray as a single binary block, it is stored as a set of smaller blocks that are compressed and stored independently.

Are there plans to support this? Can this be implemented as extension?

One simple approach would be to introduce a new yaml tag core/chunked-ndarray that consists of a yaml map that maps offsets to ndarrays, for example

chunky: !core/chunked-ndarray-1.0.0
  - !core/ndarray-chunk-1.0.0
    offset: [0,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [100,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [0,100]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  # possibly more chunks here

Has there been any work in this direction?

Thanks for opening this issue.

There has been some work adding support for the zarr storage format within ASDF. This is implemented via an extension: https://github.com/asdf-format/asdf-zarr It's a new package so please let me know if it's something you plan to use "in production" (so we can give it another review, also feel free to give it a try and open issues if you find anything). The extension offers a few options:

  • storing the zarr data inside ASDF blocks (with a chunk per block, I think most similar to what you described)
  • referencing external zarr storage (either DirectoryStore "flat files", S3 stores, or any of the many formats zarr supports).

The use of zarr also opens up a second place where compression can be controlled (which can get a bit confusing).

@braingram Nice! We are currently discussing storage formats, and both ASDF and Zarr are contenders that have various advantages and disadvantages. On the surface, using Zarr chunking with ASDF single-file storage seems like an excellent choice. I will have a look.