How to handle data with mixtures of Grib 1 and Grib 2?
alxmrs opened this issue · 4 comments
I'm running the XarrayZarrRecipe
on an internal Era 5 dataset. I just found out it uses a mixture of Grib 1 and Grib 2 standards within the same files. The simple way I can convert the corpus to Zarr would involve filtering out some of the data (e.g. ecmwf/cfgrib#2): The way cfgrib
works with xarray
is to get all the variables, we have to call open_dataset
on the same file with different filter_by_key
arguments.
Is there a clean way to work with mixed variable grib files today with pangeo-forge? If not, do we update the recipe to handle this use case?
xref:
@alxmrs, as you're probably aware, XarrayZarrRecipe.xarray_open_kwargs
allows passing arguments to open_dataset
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 293 to 297 in e6fdf87
but currently these kwargs are applied uniformly across all inputs. (So no way to vary filter_by_key
here, of course.)
I don't have first hand experience with filter_by_key
for GRIB. But it seems like you might be able to achieve the same result by loading the whole dataset via open_dataset
(so no filter_by_key
kwarg) and then conditionally dropping variables with XarrayZarrRecipe.process_input
, which is a callable with the following signature
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 657 to 658 in e6fdf87
that is applied to every input
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 305 to 306 in e6fdf87
Do you think there's a way to get your desired filtering via something like
def filter_grib(ds: xr.Dataset, filename: str):
vars_to_drop = dict(
grib_1= # iterable of vars to drop if input file is GRIB1 format
grib_2= # iterable of vars to drop if input file is GRIB2 format
)
if some_grib_1_identifier in ds.attrs:
ds = ds.drop(labels=vars_to_drop["grib_1"])
elif some_grib_2_identifier in ds.attrs:
ds = ds.drop(labels=vars_to_drop["grib_2"])
else:
raise ValueError("GRIB version not identifiable from `ds.attrs`")
recipe = XarrayZarrRecipe(..., process_input=filter_grib, ...)
?
Depending on how many inputs you have and/or the information encoded in their filenames, rather than inferring the GRIB version from ds.attrs
, you may be able either to pass filter_grib
an explicit mapping between GRIB versions and filenames, or just infer GRIB version at runtime based on the filename.
Quick note:
you might be able to achieve the same result by loading the whole dataset
Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions. Maybe that doesn't apply in this case, but I've certainly seen it.
The PR I just stared in #245 should allow you to handle this use case by providing a custom "Opener" which would dispatch the correct options depending on the filename or any other information passed from the FilePattern.
Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions.
That's exactly the case that I'm running into – and is common with grib. process_inputs
can't address this, since it assumes we've already opened the data into XArray.
#245 would definitely solve this issue! With that, we could prevent these kinds of error by suing cfgrib
directly instead of using xarray, for example.