pandas-dev/pandas

group_by produces 'minlength must be positive error' when applied to empty DataFrame

Sereger13 opened this issue · 8 comments

This used to work fine in previous versions but appears to be broken in 0.17.1

The following code:

import pandas as pd
df = pd.DataFrame({'A': [], 'B': []})
gb = df.groupby('A') .size()

Produces this error:

ValueError: minlength must be positive

In v 0.16.2 the same code produced an empty DataFrame. We'd really like to upgrade to 0.17.1 but heavily rely on this functionality so have to hold the upgrade. Checking for empty DataFrame is not going to work for us either as there are too many places where it can actually be empty.

If you can suggest any workaround in the meantime so we could upgrade that would be appreciated.

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-238.9.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.16.2
...

cc @behzadnouri

@Sereger13 I don't think their is an easy way around this w/o resorting to patching DataFrame.groupby to catch this situation (which while messy and normally nor recommended may work for you temporarily).

I see...

We found that this code:
count().iloc[:, 0]
produces very similar results to size() and seems to be working for us - but does not look particularly attractive so still deciding whether to have it or not.

If you do decide to fix size() - is there any idea when the next version/patch is going to be available? Thanks..

will be fixed; 0.18.0 prob later january

Thanks.

@Sereger13 my point about patching is that you can avoid any code changes.

note again that is a 'hack' but will work.

e.g.

In [109]: df1 = pd.DataFrame({'A': [], 'B': []})

In [110]: df2 = pd.DataFrame({'A': [1,2,1], 'B': [1,2,3]})

In [116]: def size(self):
   .....:     try:
   .....:         return self.grouper.size()
   .....:     except ValueError:
   .....:         self._set_selection_from_grouper()
   .....:         return self._selected_obj[0:0]
   .....:     

In [117]: pandas.core.groupby.GroupBy.size = size

In [118]: df1.groupby('A').size()
Out[118]: 
Empty DataFrame
Columns: [B]
Index: []

In [119]: df2.groupby('A').size()
Out[119]: 
A
1    2
2    1
dtype: int64

Great - thanks for your help.

This is more a bug in np.bincount because unnecessarily requires minlength to be strictly positive. though kind of ugly, the work-around would be simple:

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index e9aa906..d722ef8 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -1439,7 +1439,8 @@ class BaseGrouper(object):
         """
         ids, _, ngroup = self.group_info
         ids = com._ensure_platform_int(ids)
-        out = np.bincount(ids[ids != -1], minlength=ngroup)
+        mask = ids != -1
+        out = np.bincount(ids[mask], minlength=ngroup) if ngroup != 0 else []
         return Series(out, index=self.result_index, dtype='int64')

     @cache_readonly

Interesting... thanks for the update. Yes they could have made np.bincount() better indeed - allowing either None or 0 having the same meaning would make it more usable.

So it looks like simply setting ngroup to None should also do the trick:

if not ngroup:
    ngroup=None
out = np.bincount(ids[ids != -1], minlength=ngroup)

Not sure this is more readable than @behzadnouri's solution though. Looking forward for a new pandas with the workaround!