BUG: dask auto rechunking fails
pagecp opened this issue · 2 comments
- icclim version: 5.3.0
- Python version: 3.9
Description
When trying to process the CMIP6 file: data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc
icclim fails because of a dask error:
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data
Minimal reproducible example
infile='/home/jovyan/data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc'
outfile='data/CMIP6/CMIP/CCCma/CanESM5/historical/r1i1p2f1/day/tasmax/gn/v20190429/SU_icclim_tasmax_day_CanESM5_historical_r1i1p2f1_gn_18500101-20141231.nc'
icclim.index(index_name='SU', in_files=infile, var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')
Output received
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Input In [44], in <cell line: 8>()
15 outfile=outdir+"/SU_icclim_"+filename
16 print("Processing "+infile)
---> 18 icclim.index(index_name='SU', in_files=infile, var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')
File /opt/conda/lib/python3.9/site-packages/icclim/main.py:265, in index(in_files, index_name, var_name, slice_mode, time_range, out_file, threshold, callback, callback_percentage_start_value, callback_percentage_total, base_period_time_range, window_width, only_leap_years, ignore_Feb29th, interpolation, out_unit, netcdf_version, user_index, save_percentile, logs_verbosity, indice_name, user_indice, transfer_limit_Mbytes)
263 input_dataset, reset_coords_dict = update_to_standard_coords(input_dataset)
264 sampling_frequency = Frequency.lookup(slice_mode)
--> 265 input_dataset = input_dataset.chunk("auto")
266 cf_vars = build_cf_variables(
267 var_names=guess_var_names(input_dataset, in_files, index, var_name),
268 ds=input_dataset,
(...)
273 freq=sampling_frequency,
274 )
275 config = IndexConfig(
276 save_percentile=save_percentile,
277 frequency=sampling_frequency,
(...)
285 threshold=threshold,
286 )
File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:2188, in Dataset.chunk(self, chunks, name_prefix, token, lock)
2183 if bad_dims:
2184 raise ValueError(
2185 f"some chunks keys are not dimensions on this object: {bad_dims}"
2186 )
-> 2188 variables = {
2189 k: _maybe_chunk(k, v, chunks, token, lock, name_prefix)
2190 for k, v in self.variables.items()
2191 }
2192 return self._replace(variables)
File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:2189, in <dictcomp>(.0)
2183 if bad_dims:
2184 raise ValueError(
2185 f"some chunks keys are not dimensions on this object: {bad_dims}"
2186 )
2188 variables = {
-> 2189 k: _maybe_chunk(k, v, chunks, token, lock, name_prefix)
2190 for k, v in self.variables.items()
2191 }
2192 return self._replace(variables)
File /opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py:433, in _maybe_chunk(name, var, chunks, token, lock, name_prefix, overwrite_encoded_chunks)
431 token2 = tokenize(name, token if token else var._data, chunks)
432 name2 = f"{name_prefix}{name}-{token2}"
--> 433 var = var.chunk(chunks, name=name2, lock=lock)
435 if overwrite_encoded_chunks and var.chunks is not None:
436 var.encoding["chunks"] = tuple(x[0] for x in var.chunks)
File /opt/conda/lib/python3.9/site-packages/xarray/core/variable.py:1095, in Variable.chunk(self, chunks, name, lock)
1092 if utils.is_dict_like(chunks):
1093 chunks = tuple(chunks.get(n, s) for n, s in enumerate(self.shape))
-> 1095 data = da.from_array(data, chunks, name=name, lock=lock, **kwargs)
1097 return self._replace(data=data)
File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:3352, in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta, inline_array)
3348 asarray = not hasattr(x, "__array_function__")
3350 previous_chunks = getattr(x, "chunks", None)
-> 3352 chunks = normalize_chunks(
3353 chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
3354 )
3356 if name in (None, True):
3357 token = tokenize(x, chunks, lock, asarray, fancy, getitem, inline_array)
File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:2968, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
2965 chunks = tuple("auto" if isinstance(c, str) and c != "auto" else c for c in chunks)
2967 if any(c == "auto" for c in chunks):
-> 2968 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
2970 if shape is not None:
2971 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
File /opt/conda/lib/python3.9/site-packages/dask/array/core.py:3064, in auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3061 raise TypeError("dtype must be known for auto-chunking")
3063 if dtype.hasobject:
-> 3064 raise NotImplementedError(
3065 "Can not use auto rechunking with object dtype. "
3066 "We are unable to estimate the size in bytes of object data"
3067 )
3069 for x in tuple(chunks) + tuple(shape):
3070 if (
3071 isinstance(x, Number)
3072 and np.isnan(x)
3073 or isinstance(x, tuple)
3074 and np.isnan(x).any()
3075 ):
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data
From https://ncar.github.io/zulip-archive/stream/10-python-questions/topic/xr.2Econcat.3A.20auto.20rechunking.20error.html if I remove the time_bnds variable it works...
ds = xr.open_dataset(infile)
ds = ds.drop('time_bnds')
icclim.index(ds, index_name='SU', var_name='tasmax', slice_mode='JJA', out_file=outfile, logs_verbosity='HIGH')
For information:
<xarray.DataArray 'time_bnds' (time: 60225, bnds: 2)>
[120450 values with dtype=object]
Coordinates:
* time (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
height float64 ...
Dimensions without coordinates: bnds
<xarray.Dataset>
Dimensions: (time: 60225, bnds: 2, lat: 64, lon: 128)
Coordinates:
* time (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
* lat (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
* lon (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
height float64 ...
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
tasmax (time, lat, lon) float32 ...
Attributes: (12/53)
CCCma_model_hash: 8ac7a3c953a92eb65289508ded4d1b280d2bae9e
CCCma_parent_runid: p2-pictrl
CCCma_pycmor_hash: 33c30511acc319a98240633965a04ca99c26427e
CCCma_runid: p2-his01
Conventions: CF-1.7 CMIP-6.2
YMDH_branch_time_in_child: 1850:01:01:00
... ...
tracking_id: hdl:21.14100/96211482-61e0-4b17-bc3e-81f7167...
variable_id: tasmax
variant_label: r1i1p2f1
version: v20190429
license: CMIP6 model data produced by The Government ...
cmor_version: 3.4.0
Ah...
I think I introduced this bug recently because before we were chunking only the studied data variable but for simplicity I changed that to chunking on the whole dataset.
It simplifies the chunking for when there are multiple studied data variables (say, tmax and pr).
But I did not foresaw that some variable can't be chunked, I'll fix that.