`WriteCombinedReference` should emit a `zarr.storage.FSStore` (like `StoreToZarr`)
cisaacstern opened this issue · 4 comments
StoreToZarr
emits a singleton PCollection containing a zarr.storage.FSStore
. WriteCombinedReference
should as well.
This is very useful for designing pipelines that do something with the data once it's written, such as:
- Validate it with some tests
- Catalog it somewhere
- etc.
In #590 I am relying on this feature of StoreToZarr
to do integration testing, so I'm blocked by this from integration testing kerchunk stores in the same manner.
This should be as simple as returning a zarr.storage.FSStore
from this function
pangeo-forge-recipes/pangeo_forge_recipes/writers.py
Lines 95 to 99 in f094b02
which is what WriteCombinedReference
calls into here
pangeo-forge-recipes/pangeo_forge_recipes/transforms.py
Lines 464 to 465 in f094b02
I think this was addressed in the Parquet Kerchunk option PR: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/620/files. On main, write_combined_reference
now returns an fsspec target.
@norlandrhagen, looks like we're quite close but not all the way there: full_target
is an pangeo_forge_recipes.storage.FSSpecTarget
. The aim of this issue is to return a zarr.storage.FSStore
. In practical terms, what we're aiming for is to be able to pass this store
value directly into xr.open_dataset
, i.e.
pangeo-forge-recipes/examples/feedstock/noaa_oisst.py
Lines 24 to 32 in 7e34030
Naively, I initially hoped that simply returning full_target.get_mapper()
would give us the return type we want, but I'm pretty sure this won't work, since the object returned by full_target.get_mapper()
is a mapper to the directory in which the kerchunk references are stored, and not a mapper to the virtualized zarr filesystem that the references represent. To get the latter, I think we need to add a bit of logic (probably just into the body of write_combined_references
), which does the sort of thing you illustrate in your Pythia example:
What we want to return from write_combined_references
is the equivalent of the variable m
in the Pythia code snippet linked above. Does that make sense?
Ah totally, that makes sense!