Consider adding ZarrToChunks() and/or an open_zarr() helper function
shoyer opened this issue · 0 comments
shoyer commented
These could facilitate directly opening data from Zarr using idiomatic patterns in Xarray-Beam (e.g., using Xarray's lazy indexing machinery instead of dask).
I'm imaging open_zarr()
returning a tuple of values transform, template, chunks
providing exactly the information needed to use the dataset in a Zarr-to-Zarr pipeline:
transform
would be the beam PTransform that could be used in a pipeline (equivalent to the result ofxbeam.ZarrToChunks()
).template
itself would be an efficient lazy xarray.Dataset consisting of a single dask chunk, e.g., equivalent toxarray.zeros_like(xarray.open_zarr(..., chunks=None).chunk())
.chunks
would be a dict of chunks on the underlying dataset.
Usage examples:
with beam.Pipeline() as p:
p | xbeam.ZarrToChunks(..., desired_chunks) | ...
with beam.Pipeline() as p:
load_data, template, original_chunks = xbeam.open_zarr(...)
p | load_data | beam.MapTuple(...) | xbeam.ChunksToZarr(..., template, original_chunks)