scikit-learn-contrib/sklearn-pandas

GridSearchCV extremely slow with DataFrameMapper?

andytwigg opened this issue · 13 comments

I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:

df = ...
obj_cols = [(c, LabelBinarizer()) for c in X.columns if X.dtypes[c]=='O']
num_cols = [(c, StandardScaler()) for c in X.columns if X.dtypes[c]<>'O']
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid)
grid_search.fit(df[data], df[target]) # this is REALLY slow

From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:

for params in ParameterGrid(param_grid):
  pipeline.set_params(params)
  X_train, y_train, X_test, y_test = sklearn.cross_validation.train_test_split(df[data],df[target])
  pipeline.fit(X_train, y_train)
  score = pipeline.score(X_test, y_test)
X=mapper.fit_transform(df[data], y)
pipeline = Pipeline([ ('clf',SGDClassifier()) ])
grid_search = sklearn.cross_validation.GridSearchCV(pipeline,param_grid)
grid_search.fit(X,y)

So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?

More generally, is there a better way to handle categorical variables?

Could you please try to provide a code snippet that generates random that exhibits the same behavior?

It would also be interesting to report the output of a profiler, for instance using the %prun magic command in an IPython session.

import numpy as np
import pandas as pd
import random
import sklearn_pandas
import sklearn.pipeline
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.linear_model import SGDClassifier


n = 1000
k = 100
cols = dict([(str(c),np.random.randint(1000, size=n)) for c in range(k)])
df = pd.DataFrame(cols)
df['target'] = np.random.randint(2, size=n)
data = list(range(k))
target = 'target'

obj_cols = [(c, LabelBinarizer()) for c in df.columns if df.dtypes[c]=='O' and c<>target]
num_cols = [(c, StandardScaler()) for c in df.columns if df.dtypes[c]<>'O' and c<>target]
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid, verbose=2)
grid_search.fit(df[data], df[target]) # this is REALLY slow

from %prun

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      830    1.623    0.002  129.768    0.156 __init__.py:71(_get_col_subset)
   225830    1.588    0.000  104.399    0.000 series.py:489(__getitem__)
...
       28    0.009    0.000    0.011    0.000 {sklearn.linear_model.sgd_fast.plain_sgd}

Is this helpful? It seems that almost all time is spent in _get_col_subset

I'm seeing very similar behavior with sklearn_pandas.cross_val_score, I believe.

I've been investigating this and the culprits seem to be these lines:

time unit: 1e-6 s
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
90                                               @profile
91                                               def _get_col_subset(self, X, cols):
...
105                                           
106        45           27      0.6      0.0          if isinstance(X, list):
107       295       126293    428.1     70.4              X = [x[cols] for x in X]
108        45        48792   1084.3     27.2              X = pd.DataFrame(X)

Apparently the DataWrapper prevents sklearn cross-validation functions to turn the dataframe into a numpy array before getting to get_col_subset. Since the DataWrapper instance doesn't have a shape attribute, sklearn.cross_validation._safe_split returns a list of series corresponding to each row (example) to take part into CV. These Series are later grouped again in a dataframe inside the _get_col_subset method.

I'm not sure what is the best way to deal with this. Replacing the previous two lines with:

X = pd.DataFrame(X)

and leaving the cols slicing to the later code in the same function seems to provide a good speedup (around 3x) but I still have to write tests to ensure it doesn't break anything.

Perhaps we can get better speedups without the lists trick, but I don't know how to do that and at the same time avoid sklearn turning the dataframe into a numpy array.

Ideas welcome! :)

Hm, I was testing it with scikit-learn==0.15.2. It looks like this might be already solved in scikit-learn>=0.16.0, since it uses an indexable function to check the input instead of check_arrays.

See #26 (comment) and https://github.com/scikit-learn/scikit-learn/blob/0.16.0/sklearn/cross_validation.py#L1350.

Perhaps we should just write in the documentation that the custom cv-wrappers are only needed for scikit-learn<0.16.0 and that's all. What do you think @zacstewart ?

I think documenting that is a good idea, but also maybe pass-through sklearn_pandas.GridSearchCV to sklearn itself depending on the version. Is something like this worth uglying up the code to make it future-friendly?

from distutils.version import StrictVersion

if StrictVersion (sklearn.__version__) > StrictVersion('0.16'):
  sklearn_pandas.GridSearchCV = sklearn.grid_search.GridSearchCV

I don't think it's worth uglying up the code that way. We can say that these wrappers are deprecated and will be eventually dropped in sklearn-pandas 2.0. I will however make sure in a test that sklearn.grid_search.GridSearchCV in scikit-learn>=0.16.0 works with a DataFrameMapper in a pipeline.

@zacstewart can you review #48 please? It's a really minor addition but I always like the four-eyes approach to changes. :-)

Along those lines: Unfortunately the function CalibrateClassifierCV introduced in sklearn 0.16 does not seem work with DataFrameMappers in a pipeline (this is still the case in sklearn 0.17)

@Balandat Could you provide an example with a traceback (or wrong result)? Thanks.

@Balandat I'm closing this issue since it's already fixed. I've opened #53 to follow up on the issue you comment.