pandas-dev/pandas

BUG: grouping with categorical interval columns

Closed this issue · 11 comments

Versions:
pandas 1.0.3
numpy 1.18.1

There is a bug in the 1.XXX pandas release that does not allow you to group by a categorical interval index column together with another column.

import numpy as np
import pandas as pd
pd.set_option("use_inf_as_na",True)
t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})
qq = pd.qcut(t['x'], q=np.linspace(0,1,5))

This works and gives the expected result:
t.groupby([qq])['x'].agg('mean')

x (-10.001, -1.0] -1.431893 (-1.0, 0.0] -0.423564 (0.0, 1.0] 0.461174 (1.0, 10.0] 1.662297 Name: x, dtype: float64

This raises a TypeError:
t.groupby([qq,'w'])['x'].agg('mean')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-6d7782f17653> in <module>
----> 1 t.groupby([qq,'w'])['x'].agg('mean')

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    245 
    246         if isinstance(func, str):
--> 247             return getattr(self, func)(*args, **kwargs)
    248 
    249         elif isinstance(func, abc.Iterable):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1223         nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
   1224         return self._cython_agg_general(
-> 1225             "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
   1226         )
   1227 

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    907             raise DataError("No numeric types to aggregate")
    908 
--> 909         return self._wrap_aggregated_output(output)
    910 
    911     def _python_agg_general(self, func, *args, **kwargs):

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _wrap_aggregated_output(self, output)
    384             output=output, index=self.grouper.result_index
    385         )
--> 386         return self._reindex_output(result)._convert(datetime=True)
    387 
    388     def _wrap_transformed_output(

~/miniconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _reindex_output(self, output, fill_value)
   2481         levels_list = [ping.group_index for ping in groupings]
   2482         index, _ = MultiIndex.from_product(
-> 2483             levels_list, names=self.grouper.names
   2484         ).sortlevel()
   2485 

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in from_product(cls, iterables, sortorder, names)
    551 
    552         codes = cartesian_product(codes)
--> 553         return MultiIndex(levels, codes, sortorder=sortorder, names=names)
    554 
    555     @classmethod

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
    278 
    279         if verify_integrity:
--> 280             new_codes = result._verify_integrity()
    281             result._codes = new_codes
    282 

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
    366 
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
    366 
    367         codes = [
--> 368             self._validate_codes(level, code) for level, code in zip(levels, codes)
    369         ]
    370         new_codes = FrozenList(codes)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _validate_codes(self, level, code)
    302         to a level with missing values (NaN, NaT, None).
    303         """
--> 304         null_mask = isna(level)
    305         if np.any(null_mask):
    306             code = np.where(null_mask[code], -1, code)

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in isna(obj)
    124     Name: 1, dtype: bool
    125     """
--> 126     return _isna(obj)
    127 
    128 

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_old(obj)
    181         return False
    182     elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass, ABCExtensionArray)):
--> 183         return _isna_ndarraylike_old(obj)
    184     elif isinstance(obj, ABCGeneric):
    185         return obj._constructor(obj._data.isna(func=_isna_old))

~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/missing.py in _isna_ndarraylike_old(obj)
    281         else:
    282             result = np.empty(shape, dtype=bool)
--> 283             vec = libmissing.isnaobj_old(values.ravel())
    284             result[:] = vec.reshape(shape)
    285 

TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got Categorical)

This seems to work on master for me:

[ins] In [1]: import numpy as np                                                                                                                                                                             

[ins] In [2]: import pandas as pd                                                                                                                                                                            

[ins] In [3]: pd.set_option("use_inf_as_na",True)                                                                                                                                                            

[ins] In [4]: t = pd.DataFrame({"x":np.random.randn(100), 'w':np.random.choice(list("ABC"), 100)})                                                                                                           

[ins] In [5]: qq = pd.qcut(t['x'], q=np.linspace(0,1,5))                                                                                                                                                     

[ins] In [6]: t.groupby([qq,'w'])['x'].agg('mean')                                                                                                                                                           
Out[6]: 
x                              w
(-2.7649999999999997, -0.736]  A   -1.412247
                               B   -1.319972
                               C   -1.108550
(-0.736, -0.114]               A   -0.351454
                               B   -0.388151
                               C   -0.404094
(-0.114, 0.587]                A    0.134442
                               B    0.235705
                               C    0.406392
(0.587, 1.864]                 A    1.056471
                               B    0.973123
                               C    1.189502
Name: x, dtype: float64

Here is my env:

Package                Version   
---------------------- ----------
appnope                0.1.0     
asn1crypto             1.2.0     
attrs                  19.3.0    
backcall               0.1.0     
bleach                 3.1.5     
certifi                2019.11.28
cffi                   1.13.0    
chardet                3.0.4     
conda                  4.8.2     
conda-package-handling 1.6.0     
cryptography           2.8       
decorator              4.4.2     
defusedxml             0.6.0     
entrypoints            0.3       
idna                   2.8       
importlib-metadata     1.6.0     
ipykernel              5.2.1     
ipython                7.14.0    
ipython-genutils       0.2.0     
ipywidgets             7.5.1     
jedi                   0.17.0    
Jinja2                 2.11.2    
jsonschema             3.2.0     
jupyter                1.0.0     
jupyter-client         6.1.3     
jupyter-console        6.1.0     
jupyter-core           4.6.3     
MarkupSafe             1.1.1     
mistune                0.8.4     
mkl-fft                1.0.15    
mkl-random             1.1.0     
mkl-service            2.3.0     
nbconvert              5.6.1     
nbformat               5.0.6     
notebook               6.0.3     
numpy                  1.18.1    
packaging              20.3      
pandas                 1.0.3     
pandocfilters          1.4.2     
parso                  0.7.0     
pexpect                4.8.0     
pickleshare            0.7.5     
pip                    19.3.1    
prometheus-client      0.7.1     
prompt-toolkit         3.0.5     
ptyprocess             0.6.0     
pycosat                0.6.3     
pycparser              2.19      
Pygments               2.6.1     
pyOpenSSL              19.0.0    
pyparsing              2.4.7     
pyq                    4.2.1     
pyrsistent             0.16.0    
PySocks                1.7.1     
python-dateutil        2.8.1     
pytz                   2020.1    
pyzmq                  19.0.1    
qtconsole              4.7.4     
QtPy                   1.9.0     
requests               2.22.0    
ruamel-yaml            0.15.46   
Send2Trash             1.5.0     
setuptools             41.4.0    
six                    1.12.0    
terminado              0.8.3     
testpath               0.4.4     
tornado                6.0.4     
tqdm                   4.36.1    
traitlets              4.3.3     
urllib3                1.24.2    
wcwidth                0.1.9     
webencodings           0.5.1     
wheel                  0.33.6    
widgetsnbextension     3.5.1     
zipp                   3.1.0     

Also works for me on 1.3.5
image
Should this issue be closed?

would take a regression test to close

Do you mean writing a regression test for the correct behavior like this one, to make sure the behavior stays correct ?

def test_groupby_return_type():
# GH2893, return a reduced type
df1 = DataFrame(
[
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 2, "val2": 27},
{"val1": 2, "val2": 12},
]
)
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df1.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
df2 = DataFrame(
[
{"val1": 1, "val2": 20},
{"val1": 1, "val2": 19},
{"val1": 1, "val2": 27},
{"val1": 1, "val2": 12},
]
)
def func(dataf):
return dataf["val2"] - dataf["val2"].mean()
with tm.assert_produces_warning(FutureWarning):
result = df2.groupby("val1", squeeze=True).apply(func)
assert isinstance(result, Series)
# GH3596, return a consistent type (regression in 0.11 from 0.10.1)
df = DataFrame([[1, 1], [1, 1]], columns=["X", "Y"])
with tm.assert_produces_warning(FutureWarning):
result = df.groupby("X", squeeze=False).count()
assert isinstance(result, DataFrame)

yes

Noted, can I take on this issue?

After tweaking around on notebook, it seems like I could replicate the problem with the condition:

  • Using pandas 1.0.3 (no problem on 1.0.4)
  • MultiIndexing dataframe
  • Using column with type of categoricalDtype not IntervalDtype
  • Using observed=False as groupby's argument
    And it seems like the current test suite has already take this into account with these following lines:
    def test_observed(observed):
    # multiple groupers, don't re-expand the output space
    # of the grouper
    # gh-14942 (implement)
    # gh-10132 (back-compat)
    # gh-8138 (back-compat)
    # gh-8869
    cat1 = Categorical(["a", "a", "b", "b"], categories=["a", "b", "z"], ordered=True)
    cat2 = Categorical(["c", "d", "c", "d"], categories=["c", "d", "y"], ordered=True)
    df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
    df["C"] = ["foo", "bar"] * 2
    # multiple groupers with a non-cat
    gb = df.groupby(["A", "B", "C"], observed=observed)
    exp_index = MultiIndex.from_arrays(
    [cat1, cat2, ["foo", "bar"] * 2], names=["A", "B", "C"]
    )
    expected = DataFrame({"values": Series([1, 2, 3, 4], index=exp_index)}).sort_index()
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2, ["foo", "bar"]], list("ABC"), fill_value=0
    )
    tm.assert_frame_equal(result, expected)
    gb = df.groupby(["A", "B"], observed=observed)
    exp_index = MultiIndex.from_arrays([cat1, cat2], names=["A", "B"])
    expected = DataFrame({"values": [1, 2, 3, 4]}, index=exp_index)
    result = gb.sum()
    if not observed:
    expected = cartesian_product_for_groupers(
    expected, [cat1, cat2], list("AB"), fill_value=0
    )
    tm.assert_frame_equal(result, expected)

    I want to test pandas version 1.0.3 against this unit test do you have any recommendation to do so?
PrimeF commented

I've added a test for this older bug which, as stated above, was already fixed a while back.

Looks like Closed PR#52818 should have closed this task. Pinging @phofl for visibility.

Hello,
Could this task be closed?
Thank you,