ContainsArrayError: path 'lon' contains an array
pl-marasco opened this issue · 13 comments
A little bit of contest:
I have a dataset of 768 NetCDF stored as single files; each of them isn't chunked (time:1, lat:15680 lon:40320).
What I'm trying to achieve:
- conversion to the Zarr format fully readable by xarray (as I can't lose coordinates).
- chunks data so that time will result in a single dimension (time:768, lat:512, lon:512).
Issue:
When I try to create the plan I get back this error ContainsArrayError: path 'lon' contains an array
Gist:
https://gist.github.com/pl-marasco/f6e1bf9f3a0f87ce028fc68735ab25fa
IIRC, this typically happens when there's already files at target_store
or temp_store
.
Can you clear those directories prior to calling rechunk?
Unfortunatelly isn't this the case, files are removed before rechunk.
Could you post the full traceback of your error?
---------------------------------------------------------------------------
ContainsArrayError Traceback (most recent call last)
<ipython-input-9-2e6f94a3cc43> in <module>
----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
2
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
294 )
295
--> 296 copy_spec, intermediate, target = _setup_rechunk(
297 source=source,
298 target_chunks=target_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
373 variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims)
374
--> 375 copy_spec = _setup_array_rechunk(
376 dask.array.asarray(variable),
377 variable_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
493 write_chunks = tuple(int(x) for x in write_chunks)
494
--> 495 target_array = _zarr_empty(
496 shape,
497 target_store_or_group,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _zarr_empty(shape, store_or_group, chunks, dtype, name, **kwargs)
149 if name is not None:
150 assert isinstance(store_or_group, zarr.hierarchy.Group)
--> 151 return store_or_group.empty(
152 name, shape=shape, chunks=chunks, dtype=dtype, **kwargs
153 )
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in empty(self, name, **kwargs)
899 """Create an array. Keyword arguments as per
900 :func:`zarr.creation.empty`."""
--> 901 return self._write_op(self._empty_nosync, name, **kwargs)
902
903 def _empty_nosync(self, name, **kwargs):
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in _write_op(self, f, *args, **kwargs)
659
660 with lock:
--> 661 return f(*args, **kwargs)
662
663 def create_group(self, name, overwrite=False):
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\hierarchy.py in _empty_nosync(self, name, **kwargs)
905 kwargs.setdefault('synchronizer', self._synchronizer)
906 kwargs.setdefault('cache_attrs', self.attrs.cache)
--> 907 return empty(store=self._store, path=path, chunk_store=self._chunk_store,
908 **kwargs)
909
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\creation.py in empty(shape, **kwargs)
225
226 """
--> 227 return create(shape=shape, fill_value=None, **kwargs)
228
229
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
119
120 # initialize array metadata
--> 121 init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
122 fill_value=fill_value, order=order, overwrite=overwrite, path=path,
123 chunk_store=chunk_store, filters=filters, object_codec=object_codec)
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
342 _require_parent_group(path, store=store, chunk_store=chunk_store, overwrite=overwrite)
343
--> 344 _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
345 compressor=compressor, fill_value=fill_value,
346 order=order, overwrite=overwrite, path=path,
~\Anaconda3\envs\treotto_dev\lib\site-packages\zarr\storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
371 rmdir(chunk_store, path)
372 elif contains_array(store, path):
--> 373 raise ContainsArrayError(path)
374 elif contains_group(store, path):
375 raise ContainsGroupError(path)
ContainsArrayError: path 'lon' contains an array`
I don't know if this can help but time to time (I still have to understand better when it happens) I get this error too
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-2e6f94a3cc43> in <module>
----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
2
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
294 )
295
--> 296 copy_spec, intermediate, target = _setup_rechunk(
297 source=source,
298 target_chunks=target_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
373 variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims)
374
--> 375 copy_spec = _setup_array_rechunk(
376 dask.array.asarray(variable),
377 variable_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
464
465 if isinstance(target_chunks, dict):
--> 466 array_dims = _get_dims_from_zarr_array(source_array)
467 try:
468 target_chunks = _shape_dict_to_tuple(array_dims, target_chunks)
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _get_dims_from_zarr_array(z_array)
138 # use Xarray convention
139 # http://xarray.pydata.org/en/stable/internals.html#zarr-encoding-specification
--> 140 return z_array.attrs["_ARRAY_DIMENSIONS"]
141
142
AttributeError: 'Array' object has no attribute 'attrs'
So the traceback definitely suggests that Zarr thinks there is already an array at the location of the target store. Just to completely rule this out, could you add something like this to your code just before calling rechunk
import os
print(os.listdir(target_store))
The second error you posted should only be possible if you are rechunking from a Zarr array source (not an Xarray dataset). Does it arise from the same code you shared via gist above?
I'm confused by the fact that you are reporting two distinct errors in the same issue. For the same code, do you always get the same error? Or does it vary at random?
Yes, is coming from the same code.
I put the second as the two errors are presented (pass me the term) randomly.
I tested to remove the attributes and adding to the target_chunks the line:
'attrs': None
I still not have the solution and it jumps from one error to the other without any comprehensible reason to me.
About the emptiness
print(os.listdir(target_store))
FileNotFoundError Traceback (most recent call last)
in
----> 1 print(os.listdir(target_store))
2
FileNotFoundError: [WinError 3] Impossibile trovare il percorso specificato: 'c:/data/tmp/NDVI_GLOBAL.zarr'
If you need some file to make some tests you can download from here:
https://land.copernicus.vgt.vito.be/manifest/ndvi_v2_1km/manifest_cgls_ndvi_v2_1km_latest.txt
I'm sorry for your frustration. This is extremely puzzling to me. In particular, the randomness / intermittency of the problem makes it very hard to debug.
Guessing has not worked, so what we will need to do is try to craft a minimal reproducible bug report which can reproduce the same errors, ideally without using your many TB of actual data, but rather with synthetic data that are small and simple.
Could you share the full output of print(ds)
?
Let's try to solve the second problem and eventually I can reproduce in a more stable way the first one.
here a more structured MRBR
ds = xr.tutorial.load_dataset("rasm")
target_chunks = {
'Tair': {'time': 36, 'lat': 50, 'lon': 50},
'time': None,
'lat': None,
'lon': None}
mem_max = '8GB'
target_store = './output.zarr'
temp_store = './temp_store.zarr'
! rm -rf ./*.zarr
array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
Traceback
in
----> 1 array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options, executor)
294 )
295
--> 296 copy_spec, intermediate, target = _setup_rechunk(
297 source=source,
298 target_chunks=target_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_rechunk(source, target_chunks, max_mem, target_store, target_options, temp_store, temp_options)
373 variable_attrs[DIMENSION_KEY] = encode_zarr_attr_value(variable.dims)
374
--> 375 copy_spec = _setup_array_rechunk(
376 dask.array.asarray(variable),
377 variable_chunks,
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _setup_array_rechunk(source_array, target_chunks, max_mem, target_store_or_group, target_options, temp_store_or_group, temp_options, name)
464
465 if isinstance(target_chunks, dict):
--> 466 array_dims = _get_dims_from_zarr_array(source_array)
467 try:
468 target_chunks = _shape_dict_to_tuple(array_dims, target_chunks)
~\Anaconda3\envs\treotto_dev\lib\site-packages\rechunker\api.py in _get_dims_from_zarr_array(z_array)
138 # use Xarray convention
139 # http://xarray.pydata.org/en/stable/internals.html#zarr-encoding-specification
--> 140 return z_array.attrs["_ARRAY_DIMENSIONS"]
141
142
AttributeError: 'Array' object has no attribute 'attrs'
That the wrong assumption is the presence of the _ARRAY_DIMENSIONS; as this input isn't a Xarray converted to Zarr there is no attribute defined and the system fails. I've tested as well a conversion to a .zarr and a reingestion but doesn't seem to fix the problem.
Your example did not work for me, but in a different way
import xarray as xr
from rechunker import rechunk
ds = xr.tutorial.load_dataset("rasm")
target_chunks = {
'Tair': {'time': 36, 'lat': 50, 'lon': 50},
'time': None,
'lat': None,
'lon': None}
mem_max = '8GB'
target_store = './output.zarr'
temp_store = './temp_store.zarr'
! rm -rf ./*.zarr
array_plan = rechunk(ds, target_chunks, mem_max, target_store,temp_store=temp_store)
I get KeyError: 'y'
. The problem is that lon
and lat
are not dimensions on this dataset.
If I change target_chunks as follows
target_chunks = {
'Tair': {'time': 36, 'y': 50, 'x': 50},
'time': None,
'lat': None,
'lon': None}
...then the example runs with no error in my dev environment.
So somehow, in your environment, it does not realize that the input is an xarray dataset.
Can you share your rechunker version?
import rechunker
rechunker.__version__
Ok I have determined that this is a version of the bug in #59 (comment). It was fixed in #72. Basically, in your version, you can only specify chunks as a tuple, i.e. 'Tair': (36, 50, 50)
not 'Tair': {'time': 36, 'y': 50, 'x': 50}
.
I just released v0.3.3 to pypi, so you could try upgrading to see if that fixes your problem.
Solved! as well with the original dataset that I was using.
Now I'm able to create the plan.
Tnx