pangeo-data/rechunker

Rechunking does not produce .zmetadata

trondactea opened this issue · 7 comments

First, thanks for this great toolbox!

I have to rechunk an existing global zarr dataset (GLORYS ocean model) with existing chunks (1, 50, 2041, 4320) (time,depth,lat,lon). Using this global dataset, I frequently extract regional domains that are typically 10x10 degrees lat-lon in size. I thought quicker read access would be achieved if I rechunked to (324,50,100,100).

The conversion went well with rechunker but when trying to read the dataset using xarray.open_zarr it fails due to missing .zmetadata. The original .zarrdata set has consolidated metadata available.

Is there an option to create the metadata, or is my approach wrong here? My code for converting the existing zarr dataset is below. I appreciate any help here!

Thanks,
Trond

import xarray as xr
import datetime as dt
import gcsfs
import dask
import os
import shutil
from google.cloud import storage
from dask.distributed import Client
from rechunker import rechunk
import dask.array as dsa

fs = gcsfs.GCSFileSystem(token='google_default')


for var_name in ["thetao"]:
    zarr_url = f"gs://shared/zarr/copernicus/{var_name}"

    mapper = fs.get_mapper(zarr_url)

    source_array = xr.open_zarr(mapper, consolidated=True)
    print(source_array.chunks)
    
    max_mem = '1GB'
    target_chunks = {'time': 324, 'latitude': 100, 'longitude': 100}

    # you must have write access to this location
    store_tmp = fs.get_mapper('gs://shared/zarr/temp.zarr')
    store_target = fs.get_mapper('gs://shared/zarr/target.zarr')
    r = rechunk(source_array, target_chunks, max_mem, store_target, temp_store=store_tmp)
    
    result = r.execute()
    dsa.from_zarr(result)

I think you need to actually consolidate the metadata in a separate step. See here

The conversion went well with rechunker but when trying to read the dataset using xarray.open_zarr it fails due to missing .zmetadata

Can you share the full error traceback you obtained?

The full traceback is shown below when I try to run:

for var_name in ["thetao"]:
    zarr_url = f"gs://shared/zarr/target.zarr/{var_name}"
    mapper = fs.get_mapper(zarr_url)
    ds = xr.open_zarr(mapper, consolidated=True)

Traceback:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jovyan/datasets/run_zarr_tests.py", line 32, in <module>
    ds = xr.open_zarr(mapper, consolidated=True)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 768, in open_zarr
    ds = open_dataset(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 495, in open_dataset
    backend_ds = backend.open_dataset(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 824, in open_dataset
    store = ZarrStore.open_group(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 384, in open_group
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/zarr/convenience.py", line 1183, in open_consolidated
    meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
  File "/opt/conda/lib/python3.9/site-packages/zarr/storage.py", line 2590, in __init__
    meta = json_loads(store[metadata_key])
  File "/opt/conda/lib/python3.9/site-packages/fsspec/mapping.py", line 139, in __getitem__
    raise KeyError(key)
KeyError: '.zmetadata'

Ah ok, so your options are

ds = xr.open_zarr(mapper, consolidated=False)

or

from zarr.convenience import consolidate_metadata
consolidate_metadata(mapper)
ds = xr.open_zarr(mapper, consolidated=True)

Are you suggesting that we should automatically consolidate the target within rechunker?

I thought that the availability of .zmetadata for large datasets speeds up performance. If I can create the metadata using the zarr function that works well of course. For me, the automatic creation of .zmetadata would be very useful, but I don't have deep experience with zarr. Thanks for your help.

I thought that the availability of .zmetadata for large datasets speeds up performance.

It can speeds up the process of initializing the dataset itself (xr.open_zarr) if the underlying store (GCS in this case) is slow to list. There is no performance impact after that.

That makes sense. Thanks, @rabernat and @jbusecke.