BUG: grouping with categorical interval columns
Closed this issue · 11 comments
Versions:
pandas 1.0.3
numpy 1.18.1
There is a bug in the 1.XXX pandas release that does not allow you to group by a categorical interval index column together with another column.
import numpy as np
import pandas as pd
pd.set_option("use_inf_as_na",True)
t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})
qq = pd.qcut(t['x'], q=np.linspace(0,1,5))
This works and gives the expected result:
t.groupby([qq])['x'].agg('mean')
x (-10.001, -1.0] -1.431893 (-1.0, 0.0] -0.423564 (0.0, 1.0] 0.461174 (1.0, 10.0] 1.662297 Name: x, dtype: float64
This raises a TypeError:
t.groupby([qq,'w'])['x'].agg('mean')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-6d7782f17653> in <module>
----> 1 t.groupby([qq,'w'])['x'].agg('mean')
~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
245
246 if isinstance(func, str):
--> 247 return getattr(self, func)(*args, **kwargs)
248
249 elif isinstance(func, abc.Iterable):
~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
1223 nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
1224 return self._cython_agg_general(
-> 1225 "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
1226 )
1227
~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
907 raise DataError("No numeric types to aggregate")
908
--> 909 return self._wrap_aggregated_output(output)
910
911 def _python_agg_general(self, func, *args, **kwargs):
~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _wrap_aggregated_output(self, output)
384 output=output, index=self.grouper.result_index
385 )
--> 386 return self._reindex_output(result)._convert(datetime=True)
387
388 def _wrap_transformed_output(
~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
2481 levels_list = [ping.group_index for ping in groupings]
2482 index, _ = MultiIndex.from_product(
-> 2483 levels_list, names=self.grouper.names
2484 ).sortlevel()
2485
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
551
552 codes = cartesian_product(codes)
--> 553 return MultiIndex(levels, codes, sortorder=sortorder, names=names)
554
555 @classmethod
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
278
279 if verify_integrity:
--> 280 new_codes = result._verify_integrity()
281 result._codes = new_codes
282
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
366
367 codes = [
--> 368 self._validate_codes(level, code) for level, code in zip(levels, codes)
369 ]
370 new_codes = FrozenList(codes)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
366
367 codes = [
--> 368 self._validate_codes(level, code) for level, code in zip(levels, codes)
369 ]
370 new_codes = FrozenList(codes)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _validate_codes(self, level, code)
302 to a level with missing values (NaN, NaT, None).
303 """
--> 304 null_mask = isna(level)
305 if np.any(null_mask):
306 code = np.where(null_mask[code], -1, code)
~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in isna(obj)
124 Name: 1, dtype: bool
125 """
--> 126 return _isna(obj)
127
128
~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_old(obj)
181 return False
182 elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass, ABCExtensionArray)):
--> 183 return _isna_ndarraylike_old(obj)
184 elif isinstance(obj, ABCGeneric):
185 return obj._constructor(obj._data.isna(func=_isna_old))
~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_ndarraylike_old(obj)
281 else:
282 result = np.empty(shape, dtype=bool)
--> 283 vec = libmissing.isnaobj_old(values.ravel())
284 result[:] = vec.reshape(shape)
285
TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got Categorical)
This seems to work on master for me:
[ins] In [1]: import numpy as np
[ins] In [2]: import pandas as pd
[ins] In [3]: pd.set_option("use_inf_as_na",True)
[ins] In [4]: t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})
[ins] In [5]: qq = pd.qcut(t['x'], q=np.linspace(0,1,5))
[ins] In [6]: t.groupby([qq,'w'])['x'].agg('mean')
Out[6]:
x w
(-2.7649999999999997, -0.736] A -1.412247
B -1.319972
C -1.108550
(-0.736, -0.114] A -0.351454
B -0.388151
C -0.404094
(-0.114, 0.587] A 0.134442
B 0.235705
C 0.406392
(0.587, 1.864] A 1.056471
B 0.973123
C 1.189502
Name: x, dtype: float64
Here is my env:
Package Version
---------------------- ----------
appnope 0.1.0
asn1crypto 1.2.0
attrs 19.3.0
backcall 0.1.0
bleach 3.1.5
certifi 2019.11.28
cffi 1.13.0
chardet 3.0.4
conda 4.8.2
conda-package-handling 1.6.0
cryptography 2.8
decorator 4.4.2
defusedxml 0.6.0
entrypoints 0.3
idna 2.8
importlib-metadata 1.6.0
ipykernel 5.2.1
ipython 7.14.0
ipython-genutils 0.2.0
ipywidgets 7.5.1
jedi 0.17.0
Jinja2 2.11.2
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.3
jupyter-console 6.1.0
jupyter-core 4.6.3
MarkupSafe 1.1.1
mistune 0.8.4
mkl-fft 1.0.15
mkl-random 1.1.0
mkl-service 2.3.0
nbconvert 5.6.1
nbformat 5.0.6
notebook 6.0.3
numpy 1.18.1
packaging 20.3
pandas 1.0.3
pandocfilters 1.4.2
parso 0.7.0
pexpect 4.8.0
pickleshare 0.7.5
pip 19.3.1
prometheus-client 0.7.1
prompt-toolkit 3.0.5
ptyprocess 0.6.0
pycosat 0.6.3
pycparser 2.19
Pygments 2.6.1
pyOpenSSL 19.0.0
pyparsing 2.4.7
pyq 4.2.1
pyrsistent 0.16.0
PySocks 1.7.1
python-dateutil 2.8.1
pytz 2020.1
pyzmq 19.0.1
qtconsole 4.7.4
QtPy 1.9.0
requests 2.22.0
ruamel-yaml 0.15.46
Send2Trash 1.5.0
setuptools 41.4.0
six 1.12.0
terminado 0.8.3
testpath 0.4.4
tornado 6.0.4
tqdm 4.36.1
traitlets 4.3.3
urllib3 1.24.2
wcwidth 0.1.9
webencodings 0.5.1
wheel 0.33.6
widgetsnbextension 3.5.1
zipp 3.1.0
would take a regression test to close
Do you mean writing a regression test for the correct behavior like this one, to make sure the behavior stays correct ?
pandas/pandas/tests/groupby/test_groupby.py
Lines 108 to 147 in dbc3afa
yes
Noted, can I take on this issue?
After tweaking around on notebook, it seems like I could replicate the problem with the condition:
- Using pandas 1.0.3 (no problem on 1.0.4)
- MultiIndexing dataframe
- Using column with type of
categoricalDtype
notIntervalDtype
- Using
observed=False
asgroupby
's argument
And it seems like the current test suite has already take this into account with these following lines:
pandas/pandas/tests/groupby/test_categorical.py
Lines 317 to 353 in 361021b
I want to test pandas version 1.0.3 against this unit test do you have any recommendation to do so?
I've added a test for this older bug which, as stated above, was already fixed a while back.
Looks like Closed PR#52818 should have closed this task. Pinging @phofl for visibility.
Hello,
Could this task be closed?
Thank you,