ogrisel/pygbm

Usage of scorer is incorrect

NicolasHug opened this issue · 1 comments

We use scorer._score_func(y_train, predicted_train) where predicted_train = self._predict_binned(X_binned_train).

predict_binned outputs classes for classification. This won't work for e.g. scoring='neg_log_loss' which requires probabilities from predict_proba().

One solution is to get rid of scoring and check early stopping based on the loss.

If we want to keep scoring, I think we'll need to always use scorer.__call__() instead of scorer._score_func(), so that we don't reinvent the wheel.

The signature of scorer.__call__() is estimator, X, y, so we'll need to pass self. I think this is doable. I see 2 options:

  • We pass X_binned_train to scorer.__call__() and predict() (and predict_proba) has to figure out whether its input X is pre-binned or not, e.g. by checking that the dtype is uint8. This forces users to never use uint8 arrays.
  • We pass X_train instead of X_binned_train. This is slightly slower.

In both cases it's better to have a C-contiguous X for predictions. It's faster than with Fortran arrays. Here are 3 runs of bench_predictors.py from #62 (5e6 samples)

Pre-binned F .5456s
Pre-binned C .5287s
numerical F .7616s
numerical C .5716s

Pre-binned F .4914s
Pre-binned C .4690s
numerical F .6745s
numerical C .5033s

Pre-binned F .5019s
Pre-binned C .4724s
numerical F .6788s
numerical C .5099s

One solution is to get rid of scoring and check early stopping based on the loss.

Being able to enable early stopping based on the loss would be interesting and could be made the default behavior (that is, have scoring=None by default in the constructor).

However, I think it's interesting to make it possible to use a pluggable "business" metric that is not the loss for early stopping as early stopping is a form of model selection and it makes sense to use the "business" metric for that.