Rechunking does not produce .zmetadata
trondactea opened this issue · 7 comments
First, thanks for this great toolbox!
I have to rechunk an existing global zarr
dataset (GLORYS ocean model) with existing chunks (1, 50, 2041, 4320) (time,depth,lat,lon)
. Using this global dataset, I frequently extract regional domains that are typically 10x10 degrees lat-lon in size. I thought quicker read access would be achieved if I rechunked to (324,50,100,100)
.
The conversion went well with rechunker but when trying to read the dataset using xarray.open_zarr
it fails due to missing .zmetadata
. The original .zarr
data set has consolidated metadata available.
Is there an option to create the metadata, or is my approach wrong here? My code for converting the existing zarr
dataset is below. I appreciate any help here!
Thanks,
Trond
import xarray as xr
import datetime as dt
import gcsfs
import dask
import os
import shutil
from google.cloud import storage
from dask.distributed import Client
from rechunker import rechunk
import dask.array as dsa
fs = gcsfs.GCSFileSystem(token='google_default')
for var_name in ["thetao"]:
zarr_url = f"gs://shared/zarr/copernicus/{var_name}"
mapper = fs.get_mapper(zarr_url)
source_array = xr.open_zarr(mapper, consolidated=True)
print(source_array.chunks)
max_mem = '1GB'
target_chunks = {'time': 324, 'latitude': 100, 'longitude': 100}
# you must have write access to this location
store_tmp = fs.get_mapper('gs://shared/zarr/temp.zarr')
store_target = fs.get_mapper('gs://shared/zarr/target.zarr')
r = rechunk(source_array, target_chunks, max_mem, store_target, temp_store=store_tmp)
result = r.execute()
dsa.from_zarr(result)
I think you need to actually consolidate the metadata in a separate step. See here
The conversion went well with rechunker but when trying to read the dataset using
xarray.open_zarr
it fails due to missing.zmetadata
Can you share the full error traceback you obtained?
The full traceback is shown below when I try to run:
for var_name in ["thetao"]:
zarr_url = f"gs://shared/zarr/target.zarr/{var_name}"
mapper = fs.get_mapper(zarr_url)
ds = xr.open_zarr(mapper, consolidated=True)
Traceback:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jovyan/datasets/run_zarr_tests.py", line 32, in <module>
ds = xr.open_zarr(mapper, consolidated=True)
File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 768, in open_zarr
ds = open_dataset(
File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 495, in open_dataset
backend_ds = backend.open_dataset(
File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 824, in open_dataset
store = ZarrStore.open_group(
File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 384, in open_group
zarr_group = zarr.open_consolidated(store, **open_kwargs)
File "/opt/conda/lib/python3.9/site-packages/zarr/convenience.py", line 1183, in open_consolidated
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
File "/opt/conda/lib/python3.9/site-packages/zarr/storage.py", line 2590, in __init__
meta = json_loads(store[metadata_key])
File "/opt/conda/lib/python3.9/site-packages/fsspec/mapping.py", line 139, in __getitem__
raise KeyError(key)
KeyError: '.zmetadata'
Ah ok, so your options are
ds = xr.open_zarr(mapper, consolidated=False)
or
from zarr.convenience import consolidate_metadata
consolidate_metadata(mapper)
ds = xr.open_zarr(mapper, consolidated=True)
Are you suggesting that we should automatically consolidate the target within rechunker?
I thought that the availability of .zmetadata
for large datasets speeds up performance. If I can create the metadata using the zarr
function that works well of course. For me, the automatic creation of .zmetadata
would be very useful, but I don't have deep experience with zarr
. Thanks for your help.
I thought that the availability of
.zmetadata
for large datasets speeds up performance.
It can speeds up the process of initializing the dataset itself (xr.open_zarr
) if the underlying store (GCS in this case) is slow to list. There is no performance impact after that.