[BUG] Stacking Regressor does not accept GridSearchCV objects
Closed this issue · 4 comments
Describe the bug
A clear and concise description of what the bug is. Include the error message in detail.
I tried out the stacking regressor using the new julearn_sk_pandas
api.
This does not seem to be a julearn-specific bug but a quirk in the scikit-learn api from what I can gather. While the StackingClassifier
allows for hyperparameter tuning of individual models using a GridSearchCV object, the StackingRegressor
does not seem to allow this, but maybe there is something else wrong in my configuration.
To Reproduce
Steps to reproduce the behavior:
I set up julearn on the new julearn_sk_pandas
branch by running:
#!/usr/bin/bash
conda create -n example_venv python=3.9.15 seaborn
eval "$(conda shell.bash hook)"
conda activate example_venv
# check python version as expected
if [ "$(python3 --version)" != "Python 3.9.15" ]; then
echo "python version $(python3 --version) is not correct, should be 3.9.15"
exit 1
else
echo "$(python3 --version)"
fi
# install julearn
git clone https://github.com/juaml/julearn.git
cd julearn
git checkout julearn_sk_pandas
# this branch experimental,
# replace by next stable release with new API
pip install -e .
cd ..
I then ran julearn run_cross_validation with a stacked model using the PipelineCreator as follows:
from sklearn.datasets import make_regression
import pandas as pd
from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging
configure_logging(level="INFO")
# prepare data
X, y = make_regression(n_features=10, n_samples=200)
# prepare feature names and types
X_types = {
"type1": [f"type1_{x}" for x in range(1, 6)],
"type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]
# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", C=[1, 2])
model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")
# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
"stacking", estimators=[[
("model_1", model_1),
("model_2", model_2)]
],
apply_to="*"
)
# run
scores, final = run_cross_validation(
X=X_names,
y="target",
data=data,
model=model,
seed=200,
)
This gives me the following error:
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 406, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/home/lsasse/julearn_issues/julearn/julearn/base/estimators.py", line 77, in fit
self.model_.fit(Xt, y, **fit_params)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 957, in fit
return super().fit(X, y, sample_weight)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 195, in fit
names, all_estimators = self._validate_estimators()
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_base.py", line 303, in _validate_estimators
raise ValueError(
ValueError: The estimator GridSearchCV should be a regressor.
I tried the identical example with a classification problem based on the example provided in the new branch which works fine (also tuning the C parameter of the SVM).
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
System (please complete the following information):
- OS: [e.g. macOS / Linux / Windows]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
https://stackoverflow.com/questions/69269334/hyperparameter-tuning-for-stackingregressor-sklearn this would suggest to me that julearn needs to wrap hyperparameters differently in the case of stacking, i.e. define the models in the stacking regressor, and then put the stacking regressor into a gridsearchcv with a param grid that defines the hyperparameters for each model in the stacked regressor.
What's more: even if I remove the hyperparameter tuning the StackingRegressor does not seem to allow a scikit-learn pipeline as an estimator either:
from sklearn.datasets import make_regression
import pandas as pd
from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging
configure_logging(level="INFO")
# prepare data
X, y = make_regression(n_features=10, n_samples=200)
# prepare feature names and types
X_types = {
"type1": [f"type1_{x}" for x in range(1, 6)],
"type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]
# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm")
model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")
# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
"stacking", estimators=[[
("model_1", model_1),
("model_2", model_2)]
],
apply_to="*"
)
# run
scores, final = run_cross_validation(
X=X_names,
y="target",
data=data,
model=model,
seed=200,
)
which gives the error:
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 406, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/home/lsasse/julearn_issues/julearn/julearn/base/estimators.py", line 77, in fit
self.model_.fit(Xt, y, **fit_params)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 957, in fit
return super().fit(X, y, sample_weight)
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 195, in fit
names, all_estimators = self._validate_estimators()
File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_base.py", line 303, in _validate_estimators
raise ValueError(
ValueError: The estimator Pipeline should be a regressor.
this is the check sklearn is using https://github.com/scikit-learn/scikit-learn/blob/a576bcc22f0a22d0e9db1e529b21d7e1266f96ca/sklearn/ensemble/_base.py#L299 which takes on this is_regressor check: https://github.com/scikit-learn/scikit-learn/blob/a576bcc22f0a22d0e9db1e529b21d7e1266f96ca/sklearn/base.py#L1009
@LeSasse: excellent bug report!!
We just fixed this. This code should work with the new PR.
from sklearn.datasets import make_regression
import pandas as pd
from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging
configure_logging(level="INFO")
# prepare data
X, y = make_regression(n_features=10, n_samples=200)
# prepare feature names and types
X_types = {
"type1": [f"type1_{x}" for x in range(1, 6)],
"type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]
# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", C=[1, 2])
model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")
# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
"stacking", estimators=[[
("model_1", model_1),
("model_2", model_2)]
],
apply_to="*"
)
# run
scores, final = run_cross_validation(
X=X_names,
X_types=X_types,
y="target",
data=data,
model=model,
seed=200,
return_estimator="final",
)