GridSearchCV extremely slow with DataFrameMapper?
andytwigg opened this issue · 13 comments
I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:
df = ...
obj_cols = [(c, LabelBinarizer()) for c in X.columns if X.dtypes[c]=='O']
num_cols = [(c, StandardScaler()) for c in X.columns if X.dtypes[c]<>'O']
param_grid = {
'clf__loss': ['hinge', 'log', 'modified_huber'],
'clf__penalty': ('l1', 'l2', 'elasticnet'),
}
pipeline = sklearn.pipeline.Pipeline([
('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
('clf', SGDClassifier()),
])
grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid)
grid_search.fit(df[data], df[target]) # this is REALLY slow
From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:
for params in ParameterGrid(param_grid):
pipeline.set_params(params)
X_train, y_train, X_test, y_test = sklearn.cross_validation.train_test_split(df[data],df[target])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
X=mapper.fit_transform(df[data], y)
pipeline = Pipeline([ ('clf',SGDClassifier()) ])
grid_search = sklearn.cross_validation.GridSearchCV(pipeline,param_grid)
grid_search.fit(X,y)
So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?
More generally, is there a better way to handle categorical variables?
Could you please try to provide a code snippet that generates random that exhibits the same behavior?
It would also be interesting to report the output of a profiler, for instance using the %prun
magic command in an IPython session.
import numpy as np
import pandas as pd
import random
import sklearn_pandas
import sklearn.pipeline
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.linear_model import SGDClassifier
n = 1000
k = 100
cols = dict([(str(c),np.random.randint(1000, size=n)) for c in range(k)])
df = pd.DataFrame(cols)
df['target'] = np.random.randint(2, size=n)
data = list(range(k))
target = 'target'
obj_cols = [(c, LabelBinarizer()) for c in df.columns if df.dtypes[c]=='O' and c<>target]
num_cols = [(c, StandardScaler()) for c in df.columns if df.dtypes[c]<>'O' and c<>target]
param_grid = {
'clf__loss': ['hinge', 'log', 'modified_huber'],
'clf__penalty': ('l1', 'l2', 'elasticnet'),
}
pipeline = sklearn.pipeline.Pipeline([
('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
('clf', SGDClassifier()),
])
grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid, verbose=2)
grid_search.fit(df[data], df[target]) # this is REALLY slow
from %prun
ncalls tottime percall cumtime percall filename:lineno(function)
830 1.623 0.002 129.768 0.156 __init__.py:71(_get_col_subset)
225830 1.588 0.000 104.399 0.000 series.py:489(__getitem__)
...
28 0.009 0.000 0.011 0.000 {sklearn.linear_model.sgd_fast.plain_sgd}
Is this helpful? It seems that almost all time is spent in _get_col_subset
I'm seeing very similar behavior with sklearn_pandas.cross_val_score
, I believe.
I've been investigating this and the culprits seem to be these lines:
time unit: 1e-6 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
90 @profile
91 def _get_col_subset(self, X, cols):
...
105
106 45 27 0.6 0.0 if isinstance(X, list):
107 295 126293 428.1 70.4 X = [x[cols] for x in X]
108 45 48792 1084.3 27.2 X = pd.DataFrame(X)
Apparently the DataWrapper
prevents sklearn
cross-validation functions to turn the dataframe into a numpy array before getting to get_col_subset
. Since the DataWrapper
instance doesn't have a shape
attribute, sklearn.cross_validation._safe_split
returns a list of series corresponding to each row (example) to take part into CV. These Series are later grouped again in a dataframe inside the _get_col_subset
method.
I'm not sure what is the best way to deal with this. Replacing the previous two lines with:
X = pd.DataFrame(X)
and leaving the cols slicing to the later code in the same function seems to provide a good speedup (around 3x) but I still have to write tests to ensure it doesn't break anything.
Perhaps we can get better speedups without the lists trick, but I don't know how to do that and at the same time avoid sklearn turning the dataframe into a numpy array.
Ideas welcome! :)
Hm, I was testing it with scikit-learn==0.15.2
. It looks like this might be already solved in scikit-learn>=0.16.0
, since it uses an indexable
function to check the input instead of check_arrays
.
See #26 (comment) and https://github.com/scikit-learn/scikit-learn/blob/0.16.0/sklearn/cross_validation.py#L1350.
Perhaps we should just write in the documentation that the custom cv-wrappers are only needed for scikit-learn<0.16.0
and that's all. What do you think @zacstewart ?
I think documenting that is a good idea, but also maybe pass-through sklearn_pandas.GridSearchCV
to sklearn
itself depending on the version. Is something like this worth uglying up the code to make it future-friendly?
from distutils.version import StrictVersion
if StrictVersion (sklearn.__version__) > StrictVersion('0.16'):
sklearn_pandas.GridSearchCV = sklearn.grid_search.GridSearchCV
I don't think it's worth uglying up the code that way. We can say that these wrappers are deprecated and will be eventually dropped in sklearn-pandas
2.0. I will however make sure in a test that sklearn.grid_search.GridSearchCV
in scikit-learn>=0.16.0
works with a DataFrameMapper
in a pipeline.
@zacstewart can you review #48 please? It's a really minor addition but I always like the four-eyes approach to changes. :-)
Along those lines: Unfortunately the function CalibrateClassifierCV introduced in sklearn 0.16 does not seem work with DataFrameMappers in a pipeline (this is still the case in sklearn 0.17)