Shuffling with categorical data raises `AttributeError: 'ArrowStringArray' object has no attribute 'categories'`
hendrikmakait opened this issue · 0 comments
hendrikmakait commented
Describe the issue:
Minimal Complete Verifiable Example:
import dask.dataframe as dd
df = dd.from_dict(
{
"a": [1, 2, 3, 4, 5],
"b": [
"x",
"y",
"x",
"y",
"z",
],
},
npartitions=2,
)
df.b = df.b.astype("category")
res = df.shuffle("a").compute()
raises
Traceback (most recent call last):
File "/Users/hendrikmakait/projects/dask/dask-expr/reproducer.py", line 16, in <module>
res = df.shuffle("a").compute()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/hendrikmakait/projects/dask/dask-expr/dask_expr/_collection.py", line 476, in compute
return DaskMethodsMixin.compute(out, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/base.py", line 375, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/dataframe/dispatch.py", line 68, in concat
return func(
^^^^^
File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/dask/dataframe/backends.py", line 676, in concat_pandas
out[col] = union_categoricals(parts, ignore_order=ignore_order)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-expr/lib/python3.12/site-packages/pandas/core/dtypes/concat.py", line 304, in union_categoricals
if not lib.dtypes_all_equal([obj.categories.dtype for obj in to_union]):
^^^^^^^^^^^^^^
AttributeError: 'ArrowStringArray' object has no attribute 'categories'
(dask-expr)
FWIW, it doesn't matter whether I shuffle on a
or b
.