TypeError when rechunking a group
DamienIrving opened this issue · 3 comments
I'm rechunking a dataset that I ultimately want to open with xarray, so I was following the rechunk a group part of the online tutorial as follows (using rechunker version 0.5.0
):
import xarray as xr
import zarr
from rechunker import rechunk
ds = xr.tutorial.open_dataset("air_temperature")
ds = ds.chunk({'time': 100})
ds.air.encoding = {}
ds.to_zarr('air_temperature.zarr')
source_group = zarr.open('air_temperature.zarr')
source_array = source_group['air']
target_chunks_dict = {
'air': {'time': 2920, 'lat': 25, 'lon': 1},
'time': None,
'lon': None,
'lat': None,
}
max_mem = '1MB'
target_store = 'air_rechunked.zarr'
temp_store = 'air_rechunked-tmp.zarr'
array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
TypeError Traceback (most recent call last)
Input In [23], in <cell line: 10>()
8 target_store = 'air_rechunked.zarr'
9 temp_store = 'air_rechunked-tmp.zarr'
---> 10 array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:302, in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
299 if isinstance(executor, str):
300 executor = _get_executor(executor)
--> 302 copy_spec, intermediate, target = _setup_rechunk(
303 source=source,
304 target_chunks=target_chunks,
305 max_mem=max_mem,
306 target_store=target_store,
307 target_options=target_options,
308 temp_store=temp_store,
309 temp_options=temp_options,
310 )
311 plan = executor.prepare_plan(copy_spec)
312 return Rechunked(executor, plan, source, intermediate, target)
File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:472, in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
468 return copy_specs, temp_group, target_group
470 elif isinstance(source, (zarr.core.Array, dask.array.Array)):
--> 472 copy_spec = _setup_array_rechunk(
473 source,
474 target_chunks,
475 max_mem,
476 target_store,
477 target_options=target_options,
478 temp_store_or_group=temp_store,
479 temp_options=temp_options,
480 )
481 intermediate = copy_spec.intermediate.array
482 target = copy_spec.write.array
File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/api.py:531, in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
529 # don't consolidate reads for Dask arrays
530 consolidate_reads = isinstance(source_array, zarr.core.Array)
--> 531 read_chunks, int_chunks, write_chunks = rechunking_plan(
532 shape,
533 source_chunks,
534 target_chunks,
535 itemsize,
536 max_mem,
537 consolidate_reads=consolidate_reads,
538 )
540 # create target
541 shape = tuple(int(x) for x in shape) # ensure python ints for serialization
File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/algorithm.py:117, in rechunking_plan(shape, source_chunks, target_chunks, itemsize, max_mem, consolidate_reads, consolidate_writes)
114 raise ValueError(f"target_chunks {target_chunks} must have length {ndim}")
116 source_chunk_mem = itemsize * prod(source_chunks)
--> 117 target_chunk_mem = itemsize * prod(target_chunks)
119 if source_chunk_mem > max_mem:
120 raise ValueError(
121 f"Source chunk memory ({source_chunk_mem}) exceeds max_mem ({max_mem})"
122 )
File /g/data/xv83/dbi599/miniconda3/envs/agcd/lib/python3.10/site-packages/rechunker/compat.py:11, in prod(iterable)
8 try:
9 from math import prod as mathprod # type: ignore # Python 3.8
---> 11 return mathprod(iterable)
12 except ImportError:
13 return reduce(operator.mul, iterable, 1)
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
It looks like it isn't happy with the 'time': None, 'lon': None, 'lat': None
in the target chunks dictionary?
Thanks for reporting this Damian! I agree this looks like a bug. I am guessing the problem is that you do have the dimension specified as chunked here
'air': {'time': 2920, 'lat': 25, 'lon': 1},
but not here
'time': None,
'lon': None,
'lat': None,
and that is what is confusing rechunker.
To work around this, for now you might try specifying the chunks as tuples (rather than dicts), i.e.
'air': (2920, 25, 1) # assuming dim order is time, lat, lon
Ok my previous suggestion was a red herring. I think I found the issue. You want to rechunk a group. But you are passing an array to rechunk
. Try
- array_plan = rechunk(source_array, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
+ array_plan = rechunk(source_group, target_chunks_dict, max_mem, target_store, temp_store=temp_store)
That worked for me!
Here is the relevant line from the tutorial:
array_plan = rechunk(source_group, target_chunks, max_mem, target_store, temp_store=temp_store)
I suppose it is confusing to call it array_plan
when actually we are rechunking a group.
Thanks, @rabernat! That fixes it. My apologies - I should have spotted that typo.