ecmwf/earthkit-data

Inconsistent use of xarray's open methods

Opened this issue · 2 comments

What happened?

Some backends use xr.open_dataset whereas others use xr.open_mfdataset.

Because of that, our code does not work seamlessly with all datasets.
Asxr.open_mfdataset is more general and implements more functionalities, would it be possible to use it everywhere?

There's also another important downside. The behaviour of xr.open_dataset and xr.open_mfdataset is not identical with single files. For example, xr.open_mfdataset uses dask by default whereas xr.open_dataset does not (you'd have to explicitly pass the argument chunks={}).

What are the steps to reproduce the bug?

import earthkit.data

collection_id = "reanalysis-era5-single-levels"
request = {
    "variable": "2t",
    "product_type": "reanalysis",
    "date": "2012-12-01",
    "time": "12:00",
}
kwargs = {"preprocess": lambda ds: ds**2}

nc = earthkit.data.from_source("cds", collection_id, **request, format="netcdf")
nc.to_xarray(xarray_open_mfdataset_kwargs=kwargs)  # OK

grib = earthkit.data.from_source("cds", collection_id, **request, format="grib")
grib.to_xarray(xarray_open_mfdataset_kwargs=kwargs)
# TypeError: CfGribBackend.open_dataset() got an unexpected keyword argument 'preprocess'

Version

0.7.0

Platform (OS and architecture)

Linux eqc-quality-tools.eqc.compute.cci1.ecmwf.int 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 16
     13 nc.to_xarray(xarray_open_mfdataset_kwargs=kwargs)  # OK
     15 grib = earthkit.data.from_source("cds", collection_id, **request, format="grib")
---> 16 grib.to_xarray(xarray_open_mfdataset_kwargs=kwargs)
     17 # TypeError: CfGribBackend.open_dataset() got an unexpected keyword argument 'preprocess'

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/readers/grib/xarray.py:138, in XarrayMixIn.to_xarray(self, **kwargs)
    125 default.update(self.xarray_open_dataset_kwargs())
    127 xarray_open_dataset_kwargs.update(
    128     Kwargs(
    129         user=user_xarray_open_dataset_kwargs,
   (...)
    135     )
    136 )
--> 138 result = xr.open_dataset(
    139     IndexWrapperForCfGrib(self, ignore_keys=ignore_keys),
    140     **xarray_open_dataset_kwargs,
    141 )
    143 return result

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/xarray/backends/api.py:573, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    561 decoders = _resolve_decoders_kwargs(
    562     decode_cf,
    563     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    569     decode_coords=decode_coords,
    570 )
    572 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 573 backend_ds = backend.open_dataset(
    574     filename_or_obj,
    575     drop_variables=drop_variables,
    576     **decoders,
    577     **kwargs,
    578 )
    579 ds = _dataset_from_backend_dataset(
    580     backend_ds,
    581     filename_or_obj,
   (...)
    591     **kwargs,
    592 )
    593 return ds

TypeError: CfGribBackend.open_dataset() got an unexpected keyword argument 'preprocess'

Accompanying data

No response

Organisation

B-Open / CADS-EQC

@malmans2, thank you for reporting this issue. I agree that using xarray_open_mfdataset consistently would be a good idea. This will be fixed in the next release.

Also related to this issue is the following comment from @malmans2 in #375:

just wanted to provide more details about the use we are doing as you mentioned that we should not import the reader class and a new method will be added:

if isinstance(earthkit_ds, GRIBReader):
    xr_ds = earthkit_ds.to_xarray(xarray_open_dataset_kwargs={"squeeze": False, "chunks": {}})
elif isinstance(earthkit_ds, CSVReader):
    xr_ds = ds.to_xarray(pandas_read_csv_kwargs=...)
elif ...:
    ...
else:
    xr_ds = earthkit_ds.to_xarray(xarray_open_mfdataset_kwargs=...)