pangeo-forge/pangeo-forge-recipes

How to handle data with mixtures of Grib 1 and Grib 2?

alxmrs opened this issue · 4 comments

I'm running the XarrayZarrRecipe on an internal Era 5 dataset. I just found out it uses a mixture of Grib 1 and Grib 2 standards within the same files. The simple way I can convert the corpus to Zarr would involve filtering out some of the data (e.g. ecmwf/cfgrib#2): The way cfgrib works with xarray is to get all the variables, we have to call open_dataset on the same file with different filter_by_key arguments.

Is there a clean way to work with mixed variable grib files today with pangeo-forge? If not, do we update the recipe to handle this use case?

xref:

CC: @rabernat @cisaacstern

@alxmrs, as you're probably aware, XarrayZarrRecipe.xarray_open_kwargs allows passing arguments to open_dataset

kw = config.xarray_open_kwargs.copy()
if "engine" not in kw:
kw["engine"] = "h5netcdf"
logger.debug(f"about to enter xr.open_dataset context on {f}")
with xr.open_dataset(f, **kw) as ds:

but currently these kwargs are applied uniformly across all inputs. (So no way to vary filter_by_key here, of course.)

I don't have first hand experience with filter_by_key for GRIB. But it seems like you might be able to achieve the same result by loading the whole dataset via open_dataset (so no filter_by_key kwarg) and then conditionally dropping variables with XarrayZarrRecipe.process_input, which is a callable with the following signature

:param process_input: Function to call on each opened input, with signature
`(ds: xr.Dataset, filename: str) -> ds: xr.Dataset`.

that is applied to every input

if config.process_input is not None:
ds = config.process_input(ds, str(fname))

Do you think there's a way to get your desired filtering via something like

def filter_grib(ds: xr.Dataset, filename: str):
    vars_to_drop = dict(
        grib_1=  # iterable of vars to drop if input file is GRIB1 format
        grib_2= # iterable of vars to drop if input file is GRIB2 format
   )
    if some_grib_1_identifier in ds.attrs:
        ds = ds.drop(labels=vars_to_drop["grib_1"])
    elif some_grib_2_identifier in ds.attrs:
        ds = ds.drop(labels=vars_to_drop["grib_2"])
    else:
        raise ValueError("GRIB version not identifiable from `ds.attrs`")

recipe = XarrayZarrRecipe(..., process_input=filter_grib, ...)

?

Depending on how many inputs you have and/or the information encoded in their filenames, rather than inferring the GRIB version from ds.attrs, you may be able either to pass filter_grib an explicit mapping between GRIB versions and filenames, or just infer GRIB version at runtime based on the filename.

Quick note:

you might be able to achieve the same result by loading the whole dataset

Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions. Maybe that doesn't apply in this case, but I've certainly seen it.

The PR I just stared in #245 should allow you to handle this use case by providing a custom "Opener" which would dispatch the correct options depending on the filename or any other information passed from the FilePattern.

Some datasets cannot be loaded at all, because the different parts conflict in their coordinates definitions.

That's exactly the case that I'm running into – and is common with grib. process_inputs can't address this, since it assumes we've already opened the data into XArray.

#245 would definitely solve this issue! With that, we could prevent these kinds of error by suing cfgrib directly instead of using xarray, for example.