nan score for StackingClassifier due to 'scoring' argument in cross_val_score
kemaldahha opened this issue · 3 comments
Hi, I try to run the code below (Example 1 from the StackingClassifier documentation):
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
import numpy as np
import warnings
warnings.simplefilter('ignore')
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X, y,
cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
I get the following output:
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: nan (+/- nan) [StackingClassifier]
The expected output is that the score for StackingClassifier should be a number like:
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]
When I print the warning by commenting out warnings.simplefilter('ignore')
, I get the output below (I truncated it, as the warning is repeated several times):
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
[c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\model_selection\_validation.py:842](file:///C:/projects/machine-learning-matt-harrison/env/lib/site-packages/sklearn/model_selection/_validation.py:842): UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 136, in __call__
score = scorer._score(
File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 353, in _score
y_pred = method_caller(estimator, "predict", X)
File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\metrics\_scorer.py", line 86, in _cached_call
result, _ = _get_response_values(
File "c:\projects\machine-learning-matt-harrison\env\lib\site-packages\sklearn\utils\_response.py", line 74, in _get_response_values
classes = estimator.classes_
AttributeError: 'StackingClassifier' object has no attribute 'classes_'
The problem seems to be related to the scoring
argument in scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
. If I remove that argument, then the default scoring is used (accuracy, I think), and then I get the expected output which is the same as in the example in the documentation:
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.95 (+/- 0.02) [StackingClassifier]
However I would like to be able to use other scoring metrics as well (e.g. roc_auc
), but then I have to provide the argument explicitly and I get the nan score again for StackingClassifier.
I already checked issues #423 and #426, which mention a similar warning/error (AttributeError: 'StackingClassifier' object has no attribute 'classes_'
), but I couldn't figure it out based on those issues.
I am using:
- Python 3.10.0
- scikit-learn==1.3.0
- mlxtend==0.22.0
Thanks for the note! I can confirm, having this issue in sklearn 1.3.0 as well (but not in 1.2.2). I just submitted a PR via #1060 to fix that
I came across this lecture by @rasbt. Based on his explanation StackingClassifier was included in sklearn. I adjusted the code to use the sklearn version of StackingClassifier:
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
# from mlxtend.classifier import StackingClassifier
import numpy as np
import warnings
warnings.simplefilter('ignore')
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
estimators = [("clf1", clf1),
("clf2", clf2),
("clf3", clf3)]
lr = LogisticRegression()
sclf = StackingClassifier(estimators=estimators,
final_estimator=lr)
print('3-fold cross validation:\n')
for clf, label in zip([clf1, clf2, clf3, sclf],
['KNN',
'Random Forest',
'Naive Bayes',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring="accuracy")
print("Accuracy: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
Now I do get an output more in line with what I expect, though not exactly same as in the mlxtend StackingClassifier documentation (Example 1):
3-fold cross validation:
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.93 (+/- 0.02) [StackingClassifier]
Perhaps sklearn's StackingClassifier implementation is different from mlxtend's.
I am wondering whether we should still use mlxtend's StackingClassifier or whether it is deprecated and we should use sklearn's implementation instead?
Thanks for the note! I can confirm, having this issue in sklearn 1.3.0 as well (but not in 1.2.2). I just submitted a PR via #1060 to fix that
Thanks for the reply. I posted my second comment before I read your reply, apologies.