Usage of scorer is incorrect
NicolasHug opened this issue · 1 comments
We use scorer._score_func(y_train, predicted_train)
where predicted_train = self._predict_binned(X_binned_train)
.
predict_binned
outputs classes for classification. This won't work for e.g. scoring='neg_log_loss'
which requires probabilities from predict_proba()
.
One solution is to get rid of scoring
and check early stopping based on the loss.
If we want to keep scoring
, I think we'll need to always use scorer.__call__()
instead of scorer._score_func()
, so that we don't reinvent the wheel.
The signature of scorer.__call__()
is estimator, X, y
, so we'll need to pass self
. I think this is doable. I see 2 options:
- We pass
X_binned_train
toscorer.__call__()
andpredict()
(andpredict_proba
) has to figure out whether its inputX
is pre-binned or not, e.g. by checking that the dtype isuint8
. This forces users to never useuint8
arrays. - We pass
X_train
instead ofX_binned_train
. This is slightly slower.
In both cases it's better to have a C-contiguous X
for predictions. It's faster than with Fortran arrays. Here are 3 runs of bench_predictors.py
from #62 (5e6 samples)
Pre-binned F .5456s
Pre-binned C .5287s
numerical F .7616s
numerical C .5716s
Pre-binned F .4914s
Pre-binned C .4690s
numerical F .6745s
numerical C .5033s
Pre-binned F .5019s
Pre-binned C .4724s
numerical F .6788s
numerical C .5099s
One solution is to get rid of scoring and check early stopping based on the loss.
Being able to enable early stopping based on the loss would be interesting and could be made the default behavior (that is, have scoring=None
by default in the constructor).
However, I think it's interesting to make it possible to use a pluggable "business" metric that is not the loss for early stopping as early stopping is a form of model selection and it makes sense to use the "business" metric for that.