pydata/xarray

xarray can't append to Zarrs with byte-string variables

Opened this issue · 3 comments

What happened:

I tried to use xarray to append a Dataset to a Zarr containing a |S1 (char string) datatype, and received this error:

ValueError: Invalid dtype for data variable: <xarray.DataArray 'x' ()> array(b'', dtype='|S1') dtype must be a subtype of number, datetime, bool, a fixed sized string, a fixed size unicode string or an object

What you expected to happen:

I expected the Dataset to be appended to the Zarr.

Minimal Complete Verifiable Example:

Note: this is not quite "minimal", since it also performs the append using the Zarr library directly and using a |U1 (Unicode) datatype in order to demonstrate that these variations work.

import numpy as np
import xarray as xr
import zarr

def test_append(data_type, zarr_path):
    print(f"Creating {data_type} Zarr...")
    ds = xr.Dataset({"x": np.array("", dtype=data_type)})
    ds.to_zarr(zarr_path, mode="w")

    print(f"Appending to {data_type} Zarr with Zarr library...")
    zarr_to_append = zarr.open(zarr_path, mode="a")
    zarr_to_append.x.append(np.array("", dtype=data_type))

    print(f"Appending to {data_type} Zarr with xarray...")
    ds_to_append = xr.Dataset({"x": np.array("", dtype=data_type)})
    ds_to_append.to_zarr(zarr_path, mode="a")

test_append("|U1", "test-u.zarr")
test_append("|S1", "test-s.zarr")

Anything else we need to know?:

I came across this problem when converting some NetCDFs from this dataset to a Zarr, appending them along the time axis. The latest data format vesion (1.4) includes a dimensionless variable crs with type char, which xarray reads as an |S1, causing the error described above when I attempt to append. Replacing crs with a |U1-typed variable works around the problem, but is undesirable since we need to reproduce the NetCDFs as closely as possible. The example above shows that the Zarr format and library themselves don't seem to have a problem with appending byte string variables.

The obvious fix would be to loosen the type check in xarray.backends.api._validate_datatypes_for_zarr_append:

        if (
            not np.issubdtype(var.dtype, np.number)
            and not np.issubdtype(var.dtype, np.datetime64)
            and not np.issubdtype(var.dtype, np.bool_)
            and not coding.strings.is_unicode_dtype(var.dtype)
            and not coding.strings.is_bytes_dtype(var.dtype) # <- this line added to avoid "Invalid dtype" error
            and not var.dtype == object
        ):

This change makes the example above work, but I don't know if it would result in any unintended side-effects.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: 0021cda
python: 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.8.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3

xarray: 0.15.0
pandas: 0.25.3
numpy: 1.17.4
scipy: 1.3.3
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.7.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0+ds
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.8.1+dfsg
distributed: None
matplotlib: 3.1.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 45.2.0
pip: 20.0.2
conda: None
pytest: 4.6.9
IPython: 7.13.0
sphinx: 1.8.5

Could you kindly share the Python traceback from the error message?

Could you kindly share the Python traceback from the error message?

Sorry, that was remiss of me. Here's the output from the test script above, including the traceback.

$ python test_append_bytes.py 
Creating |U1 Zarr...
Appending to |U1 Zarr with Zarr library...
Appending to |U1 Zarr with xarray...
Creating |S1 Zarr...
Appending to |S1 Zarr with Zarr library...
Appending to |S1 Zarr with xarray...
Traceback (most recent call last):
  File "/home/pont/test_append_bytes.py", line 21, in <module>
    test_append("|S1", "test-s.zarr")
  File "/home/pont/test_append_bytes.py", line 18, in test_append
    ds_to_append.to_zarr(zarr_path, mode="a")
  File "/home/pont/loc/repos/xarray/xarray/core/dataset.py", line 1877, in to_zarr
    return to_zarr(
  File "/home/pont/loc/repos/xarray/xarray/backends/api.py", line 1414, in to_zarr
    _validate_datatypes_for_zarr_append(dataset)
  File "/home/pont/loc/repos/xarray/xarray/backends/api.py", line 1261, in _validate_datatypes_for_zarr_append
    check_dtype(k)
  File "/home/pont/loc/repos/xarray/xarray/backends/api.py", line 1252, in check_dtype
    raise ValueError(
ValueError: Invalid dtype for data variable: <xarray.DataArray 'x' ()>
array(b'', dtype='|S1') dtype must be a subtype of number, datetime, bool, a fixed sized string, a fixed size unicode string or an object