BUG: dask auto rechunking fails

Question

BUG: dask auto rechunking fails

pagecp opened this issue 2 years ago · 2 comments

icclim version: 5.3.0
Python version: 3.9

Description

When trying to process the CMIP6 file: data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc
icclim fails because of a dask error:
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

Minimal reproducible example

infile='/home/jovyan/data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc'
outfile='data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/SU_icclim_tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc'
icclim.index(index_name='SU', in_files=infile, var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')

Output received

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [44], in <cell line: 8>()
     15 outfile=outdir+"/SU_icclim_"+filename
     16 print("Processing "+infile)
---> 18 icclim.index(index_name='SU', in_files=infile, var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')

File /opt/conda/lib/python3.9/site-packages/icclim/main.py:265, in index(in_files, index_name, var_name, slice_mode, time_range, out_file, threshold, callback, callback_percentage_start_value, callback_percentage_total, base_period_time_range, window_width, only_leap_years, ignore_Feb29th, interpolation, out_unit, netcdf_version, user_index, save_percentile, logs_verbosity, indice_name, user_indice, transfer_limit_Mbytes)
    263 input_dataset, reset_coords_dict = update_to_standard_coords(input_dataset)
    264 sampling_frequency = Frequency.lookup(slice_mode)
--> 265 input_dataset = input_dataset.chunk("auto")
    266 cf_vars = build_cf_variables(
    267     var_names=guess_var_names(input_dataset, in_files, index, var_name),
    268     ds=input_dataset,
   (...)
    273     freq=sampling_frequency,
    274 )
    275 config = IndexConfig(
    276     save_percentile=save_percentile,
    277     frequency=sampling_frequency,
   (...)
    285     threshold=threshold,
    286 )

File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:2188, in Dataset.chunk(self, chunks, name_prefix, token, lock)
   2183 if bad_dims:
   2184     raise ValueError(
   2185         f"some chunks keys are not dimensions on this object: {bad_dims}"
   2186     )
-> 2188 variables = {
   2189     k: _maybe_chunk(k, v, chunks, token, lock, name_prefix)
   2190     for k, v in self.variables.items()
   2191 }
   2192 return self._replace(variables)

File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:2189, in <dictcomp>(.0)
   2183 if bad_dims:
   2184     raise ValueError(
   2185         f"some chunks keys are not dimensions on this object: {bad_dims}"
   2186     )
   2188 variables = {
-> 2189     k: _maybe_chunk(k, v, chunks, token, lock, name_prefix)
   2190     for k, v in self.variables.items()
   2191 }
   2192 return self._replace(variables)

File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:433, in _maybe_chunk(name, var, chunks, token, lock, name_prefix, overwrite_encoded_chunks)
    431 token2 = tokenize(name, token if token else var._data, chunks)
    432 name2 = f"{name_prefix}{name}-{token2}"
--> 433 var = var.chunk(chunks, name=name2, lock=lock)
    435 if overwrite_encoded_chunks and var.chunks is not None:
    436     var.encoding["chunks"] = tuple(x[0] for x in var.chunks)

File /opt/conda/lib/python3.9/site-packages/xarray/core/variable.py:1095, in Variable.chunk(self, chunks, name, lock)
   1092     if utils.is_dict_like(chunks):
   1093         chunks = tuple(chunks.get(n, s) for n, s in enumerate(self.shape))
-> 1095     data = da.from_array(data, chunks, name=name, lock=lock, **kwargs)
   1097 return self._replace(data=data)

File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:3352, in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta, inline_array)
   3348     asarray = not hasattr(x, "__array_function__")
   3350 previous_chunks = getattr(x, "chunks", None)
-> 3352 chunks = normalize_chunks(
   3353     chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
   3354 )
   3356 if name in (None, True):
   3357     token = tokenize(x, chunks, lock, asarray, fancy, getitem, inline_array)

File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:2968, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
   2965 chunks = tuple("auto" if isinstance(c, str) and c != "auto" else c for c in chunks)
   2967 if any(c == "auto" for c in chunks):
-> 2968     chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   2970 if shape is not None:
   2971     chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))

File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:3064, in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3061     raise TypeError("dtype must be known for auto-chunking")
   3063 if dtype.hasobject:
-> 3064     raise NotImplementedError(
   3065         "Can not use auto rechunking with object dtype. "
   3066         "We are unable to estimate the size in bytes of object data"
   3067     )
   3069 for x in tuple(chunks) + tuple(shape):
   3070     if (
   3071         isinstance(x, Number)
   3072         and np.isnan(x)
   3073         or isinstance(x, tuple)
   3074         and np.isnan(x).any()
   3075     ):

NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

Answer 1 · 2022-08-02T09:45:27.000Z

From https://ncar.github.io/zulip-archive/stream/10-python-questions/topic/xr.2Econcat.3A.20auto.20rechunking.20error.html if I remove the time_bnds variable it works...

    ds = xr.open_dataset(infile)
    ds = ds.drop('time_bnds')
    icclim.index(ds, index_name='SU', var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')

For information:

<xarray.DataArray 'time_bnds' (time: 60225, bnds: 2)>
[120450 values with dtype=object]
Coordinates:
  * time     (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
    height   float64 ...
Dimensions without coordinates: bnds

<xarray.Dataset>
Dimensions:    (time: 60225, bnds: 2, lat: 64, lon: 128)
Coordinates:
  * time       (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tasmax     (time, lat, lon) float32 ...
Attributes: (12/53)
    CCCma_model_hash:            8ac7a3c953a92eb65289508ded4d1b280d2bae9e
    CCCma_parent_runid:          p2-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 p2-his01
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    tracking_id:                 hdl:21.14100/96211482-61e0-4b17-bc3e-81f7167...
    variable_id:                 tasmax
    variant_label:               r1i1p2f1
    version:                     v20190429
    license:                     CMIP6 model data produced by The Government ...
    cmor_version:                3.4.0

Answer 2 · 2022-08-03T08:17:32.000Z

Ah...
I think I introduced this bug recently because before we were chunking only the studied data variable but for simplicity I changed that to chunking on the whole dataset.
It simplifies the chunking for when there are multiple studied data variables (say, tmax and pr).
But I did not foresaw that some variable can't be chunked, I'll fix that.