pangeo-forge/pangeo-forge-recipes

`WriteCombinedReference` should emit a `zarr.storage.FSStore` (like `StoreToZarr`)

cisaacstern opened this issue · 4 comments

StoreToZarr emits a singleton PCollection containing a zarr.storage.FSStore. WriteCombinedReference should as well.

This is very useful for designing pipelines that do something with the data once it's written, such as:

  • Validate it with some tests
  • Catalog it somewhere
  • etc.

In #590 I am relying on this feature of StoreToZarr to do integration testing, so I'm blocked by this from integration testing kerchunk stores in the same manner.

This should be as simple as returning a zarr.storage.FSStore from this function

def write_combined_reference(
reference: MultiZarrToZarr,
full_target: FSSpecTarget,
output_json_fname: str,
):

which is what WriteCombinedReference calls into here

return reference | beam.Map(
write_combined_reference,

I think this was addressed in the Parquet Kerchunk option PR: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/620/files. On main, write_combined_reference now returns an fsspec target.

@norlandrhagen, looks like we're quite close but not all the way there: full_target is an pangeo_forge_recipes.storage.FSSpecTarget. The aim of this issue is to return a zarr.storage.FSStore. In practical terms, what we're aiming for is to be able to pass this store value directly into xr.open_dataset, i.e.

def test_ds(store: zarr.storage.FSStore) -> zarr.storage.FSStore:
# This fails integration test if not imported here
# TODO: see if --setup-file option for runner fixes this
import xarray as xr
ds = xr.open_dataset(store, engine="zarr", chunks={})
for var in ["anom", "err", "ice", "sst"]:
assert var in ds.data_vars
return store

Naively, I initially hoped that simply returning full_target.get_mapper() would give us the return type we want, but I'm pretty sure this won't work, since the object returned by full_target.get_mapper() is a mapper to the directory in which the kerchunk references are stored, and not a mapper to the virtualized zarr filesystem that the references represent. To get the latter, I think we need to add a bit of logic (probably just into the body of write_combined_references), which does the sort of thing you illustrate in your Pythia example:

https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/GRIB2_HRRR.html#load-kerchunked-dataset

What we want to return from write_combined_references is the equivalent of the variable m in the Pythia code snippet linked above. Does that make sense?

Ah totally, that makes sense!