google/xarray-beam

Consider adding ZarrToChunks() and/or an open_zarr() helper function

shoyer opened this issue · 0 comments

These could facilitate directly opening data from Zarr using idiomatic patterns in Xarray-Beam (e.g., using Xarray's lazy indexing machinery instead of dask).

I'm imaging open_zarr() returning a tuple of values transform, template, chunks providing exactly the information needed to use the dataset in a Zarr-to-Zarr pipeline:

  • transform would be the beam PTransform that could be used in a pipeline (equivalent to the result of xbeam.ZarrToChunks()).
  • template itself would be an efficient lazy xarray.Dataset consisting of a single dask chunk, e.g., equivalent to xarray.zeros_like(xarray.open_zarr(..., chunks=None).chunk()).
  • chunks would be a dict of chunks on the underlying dataset.

Usage examples:

with beam.Pipeline() as p:
  p | xbeam.ZarrToChunks(..., desired_chunks) | ...
with beam.Pipeline() as p:
  load_data, template, original_chunks = xbeam.open_zarr(...)
  p | load_data | beam.MapTuple(...) | xbeam.ChunksToZarr(..., template, original_chunks)