juaml/julearn

[BUG] Stacking Regressor does not accept GridSearchCV objects

Closed this issue · 4 comments

Describe the bug
A clear and concise description of what the bug is. Include the error message in detail.

I tried out the stacking regressor using the new julearn_sk_pandas api.

This does not seem to be a julearn-specific bug but a quirk in the scikit-learn api from what I can gather. While the StackingClassifier
allows for hyperparameter tuning of individual models using a GridSearchCV object, the StackingRegressor
does not seem to allow this, but maybe there is something else wrong in my configuration.

To Reproduce
Steps to reproduce the behavior:

I set up julearn on the new julearn_sk_pandas branch by running:

#!/usr/bin/bash

conda create -n example_venv python=3.9.15 seaborn 
eval "$(conda shell.bash hook)"
conda activate example_venv

# check python version as expected
if [ "$(python3 --version)" != "Python 3.9.15" ]; then
  echo "python version $(python3 --version) is not correct, should be 3.9.15"
  exit 1
else
  echo "$(python3 --version)"
fi

# install julearn
git clone https://github.com/juaml/julearn.git
cd julearn
git checkout julearn_sk_pandas
# this branch experimental, 
# replace by next stable release with new API
pip install -e .
cd ..       

I then ran julearn run_cross_validation with a stacked model using the PipelineCreator as follows:

from sklearn.datasets import make_regression
import pandas as pd

from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging

configure_logging(level="INFO")

# prepare data
X, y = make_regression(n_features=10, n_samples=200)

# prepare feature names and types
X_types = {
    "type1": [f"type1_{x}" for x in range(1, 6)],
    "type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]

# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", C=[1, 2])

model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")

# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
    "stacking", estimators=[[
        ("model_1", model_1),
        ("model_2", model_2)]
    ],
    apply_to="*"
)

# run
scores, final = run_cross_validation(
    X=X_names,
    y="target",
    data=data,
    model=model,
    seed=200,
)

This gives me the following error:

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 406, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/lsasse/julearn_issues/julearn/julearn/base/estimators.py", line 77, in fit
    self.model_.fit(Xt, y, **fit_params)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 957, in fit
    return super().fit(X, y, sample_weight)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 195, in fit
    names, all_estimators = self._validate_estimators()
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_base.py", line 303, in _validate_estimators
    raise ValueError(
ValueError: The estimator GridSearchCV should be a regressor.

I tried the identical example with a classification problem based on the example provided in the new branch which works fine (also tuning the C parameter of the SVM).

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: [e.g. macOS / Linux / Windows]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

https://stackoverflow.com/questions/69269334/hyperparameter-tuning-for-stackingregressor-sklearn this would suggest to me that julearn needs to wrap hyperparameters differently in the case of stacking, i.e. define the models in the stacking regressor, and then put the stacking regressor into a gridsearchcv with a param grid that defines the hyperparameters for each model in the stacked regressor.

What's more: even if I remove the hyperparameter tuning the StackingRegressor does not seem to allow a scikit-learn pipeline as an estimator either:

from sklearn.datasets import make_regression
import pandas as pd

from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging

configure_logging(level="INFO")

# prepare data
X, y = make_regression(n_features=10, n_samples=200)

# prepare feature names and types
X_types = {
    "type1": [f"type1_{x}" for x in range(1, 6)],
    "type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]

# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm")

model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")

# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
    "stacking", estimators=[[
        ("model_1", model_1),
        ("model_2", model_2)]
    ],
    apply_to="*"
)

# run
scores, final = run_cross_validation(
    X=X_names,
    y="target",
    data=data,
    model=model,
    seed=200,
)

which gives the error:

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/pipeline.py", line 406, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/lsasse/julearn_issues/julearn/julearn/base/estimators.py", line 77, in fit
    self.model_.fit(Xt, y, **fit_params)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 957, in fit
    return super().fit(X, y, sample_weight)
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_stacking.py", line 195, in fit
    names, all_estimators = self._validate_estimators()
  File "/home/lsasse/miniconda3/envs/example_venv/lib/python3.9/site-packages/sklearn/ensemble/_base.py", line 303, in _validate_estimators
    raise ValueError(
ValueError: The estimator Pipeline should be a regressor.

@LeSasse: excellent bug report!!

We just fixed this. This code should work with the new PR.

from sklearn.datasets import make_regression
import pandas as pd

from julearn.api import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging

configure_logging(level="INFO")

# prepare data
X, y = make_regression(n_features=10, n_samples=200)

# prepare feature names and types
X_types = {
    "type1": [f"type1_{x}" for x in range(1, 6)],
    "type2": [f"type2_{x}" for x in range(1, 6)],
}
X_names = X_types["type1"] + X_types["type2"]

# make df
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

# create individual models
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", C=[1, 2])

model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add("rf")

# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
    "stacking", estimators=[[
        ("model_1", model_1),
        ("model_2", model_2)]
    ],
    apply_to="*"
)

# run
scores, final = run_cross_validation(
    X=X_names,
    X_types=X_types,
    y="target",
    data=data,
    model=model,
    seed=200,
    return_estimator="final",
)