Regression with Zarr: ReadOnlyError

Question

Regression with Zarr: ReadOnlyError

rabernat opened this issue 2 years ago · 14 comments

Tests with the latest dev environment are failing with errors like this


tmp_path = PosixPath('/private/var/folders/kl/7rfdrpx96bb0rhbnl5l2dnkw0000gn/T/pytest-of-rabernat/pytest-69/test_rechunk_group_mapper_temp7')
executor = 'python', source_store = 'mapper.source.zarr', target_store = <fsspec.mapping.FSMap object at 0x1174e3520>
temp_store = <fsspec.mapping.FSMap object at 0x1174e3400>

    @pytest.mark.parametrize(
        "executor",
        [
            "dask",
            "python",
            requires_beam("beam"),
            requires_prefect("prefect"),
        ],
    )
    @pytest.mark.parametrize("source_store", ["source.zarr", "mapper.source.zarr"])
    @pytest.mark.parametrize("target_store", ["target.zarr", "mapper.target.zarr"])
    @pytest.mark.parametrize("temp_store", ["temp.zarr", "mapper.temp.zarr"])
    def test_rechunk_group(tmp_path, executor, source_store, target_store, temp_store):
        if source_store.startswith("mapper"):
            fsspec = pytest.importorskip("fsspec")
            store_source = fsspec.get_mapper(str(tmp_path) + source_store)
            target_store = fsspec.get_mapper(str(tmp_path) + target_store)
            temp_store = fsspec.get_mapper(str(tmp_path) + temp_store)
        else:
            store_source = str(tmp_path / source_store)
            target_store = str(tmp_path / target_store)
            temp_store = str(tmp_path / temp_store)
    
>       group = zarr.group(store_source)

tests/test_rechunk.py:457: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/hierarchy.py:1355: in group
    init_group(store, overwrite=overwrite, chunk_store=chunk_store,
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:648: in init_group
    _init_group_metadata(store=store, overwrite=overwrite, path=path,
../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:711: in _init_group_metadata
    store[key] = store._metadata_class.encode_group_metadata(meta)  # type: ignore
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <zarr.storage.FSStore object at 0x1174e34c0>, key = '.zgroup', value = b'{\n    "zarr_format": 2\n}'

    def __setitem__(self, key, value):
        if self.mode == 'r':
>           raise ReadOnlyError()
E           zarr.errors.ReadOnlyError: object is read-only

../../../mambaforge/envs/rechunker/lib/python3.9/site-packages/zarr/storage.py:1410: ReadOnlyError
==

This is the cause of the test failures in #134.

Answer 1 · 2023-03-14T21:31:02.000Z

Shoot, I'm still getting the read_only errors with 0.5.1:
https://nbviewer.org/gist/85a34aed6e432d0d8502841076bbab92

Answer 2 · 2023-03-14T21:36:37.000Z

I think you may be hitting a version of zarr-developers/zarr-python#1353 because you are calling

m = fs.get_mapper("")

Try updating to the latest zarr version, or else creating an FSStore instead.

Answer 3 · 2023-03-14T21:37:10.000Z

Okay, will do!

Answer 4 · 2023-03-14T21:37:41.000Z

Would be helpful to confirm which Zarr version you had installed.

Answer 5 · 2023-03-14T23:06:01.000Z

Hmm, zarr=2.13.6, the latest from conda-forge. I see that zarr=2.14.2 has been released though. I'll try pip installing that.

Answer 6 · 2023-03-15T00:17:30.000Z

Okay, with the latest zarr=2.14.2, I don't get the read_only errors.

But the workflow fails near the end of the rechunking process:


KilledWorker: Attempted to run task ('copy_intermediate_to_write-bca90f45d4dc080cca14b54ce5a10d1f', 2) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls://10.10.105.181:35291. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

The logs from those workers are not available on the dashboard, I guess because the workers died, right?

This rechunker workflow was working in December. Should I revert to zarr and rechunker from that era?

Answer 7 · 2023-03-15T00:20:07.000Z

Ideally you would figure out what is going wrong and help us fix it, rather than rolling back to an earlier version. After all, you're a rechunker maintainer now! 😉

Are you sure that all your package versions match on your workers?

Answer 8 · 2023-03-15T00:20:27.000Z

I'm certainly willing to try to help debug it, but don't really know where to start. If you have ideas, I'm game to try them.

One of the nice things about nebari/conda-store is the notebook and workers see the same environment (accessed from the conda-store pod), so the versions always match.

I added you to the ESIP Nebari deployment if you are interested in checking it out.

https://nebari.esipfed.org/hub/user-redirect/lab/tree/shared/users/Welcome.ipynb

https://nebari.esipfed.org/hub/user-redirect/lab/tree/shared/users/rsignell/notebooks/NWM/rechunk_grid/03_rechunk.ipynb

Answer 9 · 2023-03-15T02:20:53.000Z

I won't be able to log into the ESIP cluster to debug your failing computation. If you think there has been a regression in rechunker in the new release, I strongly encourage you to develop a minimum reproducible example and share it via the issue tracker.

If you have ideas, I'm game to try them.

My first idea would be to freeze every package version except rechunker in your environment, and then try running the exact same workflow with only different rechunker versions (say 0.5.0 vs 0.5.1). Your example has a million moving pieces. Dask, Zarr, kerchunk, xarray, etc etc. It's impossible to say whether your problem is caused by a change in rechunker unless you can isolate this. There have been extremely few changes to rechunker over the past year. Nothing that obviously would cause your dask workers to start running out of memory.

Answer 10 · 2023-03-15T18:15:00.000Z

I've confirmed that my rechunking workflow runs successfully if I pin zarr=2.13.3:

cf_xarray                 0.8.0              pyhd8ed1ab_0    conda-forge
dask                      2023.3.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.3.1           pyhd8ed1ab_0    conda-forge
dask-gateway              2022.4.0           pyh8af1aa0_0    conda-forge
dask-geopandas            0.3.0              pyhd8ed1ab_0    conda-forge
dask-image                2022.9.0           pyhd8ed1ab_0    conda-forge
fsspec                    2023.3.0+5.gbac7529          pypi_0    pypi
intake-xarray             0.6.1              pyhd8ed1ab_0    conda-forge
jupyter_server_xarray_leaflet 0.2.3              pyhd8ed1ab_0    conda-forge
numcodecs                 0.11.0          py310heca2aa9_1    conda-forge
pint-xarray               0.3                pyhd8ed1ab_0    conda-forge
rechunker                 0.5.1                    pypi_0    pypi
rioxarray                 0.13.4             pyhd8ed1ab_0    conda-forge
s3fs                      2022.11.0       py310h06a4308_0  
xarray                    2023.2.0           pyhd8ed1ab_0    conda-forge
xarray-datatree           0.0.12             pyhd8ed1ab_0    conda-forge
xarray-spatial            0.3.5              pyhd8ed1ab_0    conda-forge
xarray_leaflet            0.2.3              pyhd8ed1ab_0    conda-forge
zarr                      2.13.3             pyhd8ed1ab_0    conda-forge

If I change to zarr=2.13.6 I get the ReadOnlyError: object is read-only error.
If I change to zarr=2.14.2 I get the dask workers dying.

Answer 11 · 2023-03-15T18:16:39.000Z

@gzt5142 has a minimal reproducible example he will post shortly. But should this be raised as a zarr issue?

Answer 12 · 2023-03-15T18:18:20.000Z

Thanks a lot for looking into this Rich!

But should this be raised as a zarr issue?

How minimal is it? Can you decouple it from the dask and rechunker issues? Can you say more about what you think the root problem is?

Answer 13 · 2023-03-22T16:32:59.000Z

Unfortunately it turns out the minimal example we created works fine -- does not trigger the problem described here. :(

Answer 14 · 2023-03-31T15:25:27.000Z

I'm going to reopen this issue.

If there is a bug somewhere in our stack that is preventing rechunker from working properly, we really need to get to the bottom of it.