pangeo-data/rechunker

Inconsistent `target_chunks` api behavior between zarr group and xarray dataset

rabernat opened this issue · 0 comments

The docs say the following about the target_chunks argument when rechunking a group:

For a group of arrays, a dict is required. The keys correspond to array names. The values are target_chunks arguments for the array. For example, {'foo': (20, 10), 'bar': {'x': 3, 'y': 5}, 'baz': None}. All arrays you want to rechunk must be explicitly named. Arrays that are not present in the target_chunks dict will be ignored.

Xarray datasets are very similar to Zarr groups. However, the behavior is a bit different with Xarray datasets. This difference is documented in the tests, but not the docs. Here is the target_chunks parameter for test_rechunk_dataset

"target_chunks",
[{"a": (20, 10), "b": (20,)}, {"a": {"x": 20, "y": 10}, "b": {"x": 20}}],

Note that the variable c is not present. However, it is present in the output dataset:
assert dst.a.data.chunksize == target_chunks_expected
assert dst.b.data.chunksize == target_chunks_expected[:1]
assert dst.c.data.chunksize == source_chunks[1:]

The original chunks have been preserved, a reasonable default.

We should strive to reconcile, or at least document, this difference. My personal preference would be to change the API so that at flat zarr group behaves the same as the xarray dataset: variables that are not mentioned in target_chunks simply get passed through with identical chunks.

cc @eric-czech who wrote the test_rechunk_dataset so probably understands this part of the code the best.