dodger487/dplython

Slowness when grouping on a large number of keys

Closed this issue · 5 comments

csaid commented

Hi, maybe you already know about this, but just something important to have on your radar. When grouping on large number of keys, things can get very slow. I had to switch back to regular pandas when an operation was taking > 10 minutes.

# Grouping variable with 5 values -> Get results immediately
diamonds >> group_by(X.cut) >> mutate(m = X.x.mean()

# Grouping variable with 273 values -> Get results after 10 seconds. 
# For larger data frames, can take more than 10 minutes
diamonds >> group_by(X.carat) >> mutate(m = X.x.mean())

# The same operation in standard pandas happens instantaneously
diamonds.groupby('carat').mean()
csaid commented

I guess since I was giving a mutate() example rather than a summarize() example, the standard pandas comparison should be this instead, which still happens instantaneously:

diamonds.assign(m=lambda x: x.groupby('carat').x.transform('mean'))

Thanks for bringing this up. dplython is not currently using the built-in pandas grouping in the dplython group_by. I think dplython should do this-- it will give dplython code all the great speed advantages of pandas.

Technically, this should be doable using either the pandas transform or apply methods. I have a concern that if I have to use transform, then I may need to refactor the Later object. At any rate, I think this is doable and sounds like an important priority.

#18 has helped with this issue, though it's not done yet.

Previously:

In [3]: %timeit diamonds >> group_by(X.carat) >> mutate(foo=X.x.mean())
"""
dplython/dplython.py:393: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[key] = val.applyFcns(df)
1 loops, best of 3: 8.72 s per loop
"""

Now:

In [46]: %timeit diamonds >> group_by(X.carat) >> mutate(foo=X.x.mean())
"""
1 loops, best of 3: 346 ms per loop
"""

So .346 seconds vs. 8.72 seconds... that's a bit of a performance gain! But there's still a ways to go:

In [49]: %timeit diamonds.assign(foo=diamonds.groupby("carat").x.transform("mean"))
"""
100 loops, best of 3: 3.92 ms per loop
"""

In [50]: %timeit diamonds.assign(foo=diamonds.groupby("carat").x.transform(lambda n: n.mean()))
"""
10 loops, best of 3: 39.7 ms per loop
"""

I also tried running this experimentally with a dataset of ~3 million rows. Grouping into 550k groups and running a mutate still took around 2 minutes, unfortunately. But it's way faster than before.

Now, pandas grouping happens in the backend and dplython is a little smarter about when it needs to make a copy of the dataframe. Currently we're using apply and not transform in the backend. apply seems to be much slower. Rewriting your code in terms of apply still runs pretty slowly. I'm still trying to figure out a way to use transform instead. The main difficulty is it looks like transform runs on each Series, instead of the DataFrame as a whole, meaning it's hard to write things like X.x + X.y. I may need to end up touching Later, as I said earlier, which will probably speed everything up but will be a bit more engineering effort.

More progress on this with #63:

# PREVIOUSLY
In [46]: %timeit diamonds >> group_by(X.carat) >> mutate(foo=X.x.mean())
"""
1 loops, best of 3: 346 ms per loop
"""

# NOW
In [6]: %timeit diamonds >> group_by(X.carat) >> mutate(foo=X.x.mean())
"""
10 loops, best of 3: 64.5 ms per loop
"""

With #74, dplython should now have comparable speeds to pandas in groups with large numbers of keys.

from dplython import *
diamonds = diamonds >> mutate(bin=X["Unnamed: 0"] % 5000)
gbinp = diamonds.groupby("bin")
gbind = diamonds >> group_by(X.bin)

%timeit gbind >> summarize(foo=X.x.median() + X.price.sum())
%timeit gbinp.x.agg(lambda x: x.median()) + gbinp.price.agg(lambda x: x.sum())

## -- End pasted text --
1 loops, best of 3: 1.29 s per loop  # dplython speed
1 loops, best of 3: 1.19 s per loop  # pandas speed

There are still some areas where dplython can be sped up, but this specific issue of large numbers of groups seems to be solved. If anyone finds a case that's bad, please comment here or open a new issue.

The next big issue in speed is dplython does a bunch of unnecessary copies.