ecmwf/cfgrib

Support DataTree for organizing Datasets by type of level

jthielen opened this issue · 4 comments

As discussed in xarray-contrib/datatree#195, it would be wonderful (and relatively straightforward) to add support for DataTree in cfgrib. This would allow a improved organization of the different datasets that would have been previously been returned from cfgrib.open_datasets() in a single data collection.

As far as implementation, I would propose refactoring the existing open_datasets() to something like:

def open_datatree(path, backend_kwargs={}, **kwargs):
    # type: (str, T.Dict[str, T.Any], T.Any) -> datatree.DataTree
    """
    Open a GRIB file groupping incompatible hypercubes to different datasets via simple heuristics.
    """
    squeeze = backend_kwargs.get("squeeze", True)
    backend_kwargs = backend_kwargs.copy()
    backend_kwargs["squeeze"] = False
    datasets = open_variable_datasets(path, backend_kwargs=backend_kwargs, **kwargs)

    type_of_level_datasets = {}  # type: T.Dict[str, T.List[xr.Dataset]]
    for ds in datasets:
        for _, da in ds.data_vars.items():
            type_of_level = da.attrs.get("GRIB_typeOfLevel", "undef")
            type_of_level_datasets.setdefault(type_of_level, []).append(ds)

    return datatree.DataTree.from_dict(type_of_level_datasets)

Then, open_datasets could be re-implemented something like:

def open_datasets(path, backend_kwargs={}, **kwargs):
    type_of_level_datasets = open_datatree(path, backend_kwargs=backend_kwargs, **kwargs)
    merged = []  # type: T.List[xr.Dataset]
    for type_of_level in sorted(type_of_level_datasets):
        for ds in merge_datasets(type_of_level_datasets[type_of_level], join="exact"):
            merged.append(ds.squeeze() if squeeze else ds)
    return merged

(these snippets were edited quick in-between conference sessions; no guarantee that I didn't miss something and these don't work properly as-is)

This all being said, discussions would likely need to happen to decide whether this should be supported before or after integration of DataTree into xarray proper (xref pydata/xarray#7418).

cc @TomNicholas, @blaylockbk

#187 and #321 are additional cases where Datatree could help cfgrib: Different stepRange for precipitation (and other?) variables.