scikit-learn-contrib/category_encoders

[FEATURE] K-fold Target Encoder

nilslacroix opened this issue · 8 comments

My proposal ist to implement a K-fold Parameter For all Target encoders, which Trains an Encoder in CV Fashion on every fold instead of the whole Dataset. This has huge benefits in Terms of overfitting and Performance.

Hi @nilslacroix
I don't think this works since there is no measure for "performance" in the encoder itself. You can only measure the performance if you fit a model afterwards. Also fixing a model beforehand and then doing CV on the encoder does not make sense since there might be a better encoder hyperparameters + model hyperparameters combination. So the only way is to cross validate a sk pipeline with encoder and model parameters. This should be possible already, right?

May I Point you to this article https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8 ? This Shows way better encoding Performance in His Test Cases when encoders are Trained on folds beforehand :)

Hi,
thanks for pointing that out. I wasn't aware of it.
But the single validation is precisely what I described above with packing the encoder together with a model into a sklearn pipeline and fit it using cross validation? I see that double validation is missing. Maybe @DenisVorotyntsev himself could also shed some light on this? maybe I'm getting things wrong here.

I do not think that is how pipelines work? It is my understanding that all the transformations are just done once on the train/test set and in the last step the estimator is called. If crossvalidation is used I would assume that the whole dataset is split at the last step of the pipeline, not before the transformations happen. Thus the encoder would just be fitted once on the whole training set, then transform it and then transform the test set.

I haven't tried it myself but the documentation suggest it is possible to have multiple estimators (remember the encoders that require fitting are estimators) in one pipeline.
https://scikit-learn.org/stable/modules/compose.html
In their example they chain the dimension for PCA with a model and then do a GridSearchCV on both. This is exactly what we want right?

Hmm this might be it. If you pass any arbitrary parameter of an encoder to the gridsearch, and the gridsearch is validated over different folds, this would naturally lead to n encoders fitted on n folds. But I am still not sure if this is how it works internally.

I think this is pretty much how it works. I just did a small example:

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from category_encoders import TargetEncoder
from sklearn.model_selection import GridSearchCV

estimators = [('encoder', TargetEncoder()), ('clf', SVC())]
pipe = Pipeline(estimators)

y = [0, 0, 1, 0, 1, 1, 1]*2
train_df = pd.DataFrame({'cat_feat':['a', 'a', 'd', 'd', 'd', 'f', 'f']*2, 'num_feat': [0, 2, 3, 1, 4.4, 2.3, 9.5]*2})

param_grid = dict(encoder__smoothing=[1.0, 100.0, 10000.0],
                  clf__C=[0.1, 10]
                  )
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=2, verbose=2)
grid_search.fit(train_df, y)

Sorry for the strange input, I just took it from another issue that I've been working on.

This gives exactly the fit on the folds we would expect

Fitting 2 folds for each of 6 candidates, totalling 12 fits
[CV] END .................clf__C=0.1, encoder__smoothing=1.0; total time=   0.0s
[CV] END .................clf__C=0.1, encoder__smoothing=1.0; total time=   0.0s
[CV] END ...............clf__C=0.1, encoder__smoothing=100.0; total time=   0.0s
[CV] END ...............clf__C=0.1, encoder__smoothing=100.0; total time=   0.0s
[CV] END .............clf__C=0.1, encoder__smoothing=10000.0; total time=   0.0s
[CV] END .............clf__C=0.1, encoder__smoothing=10000.0; total time=   0.0s
[CV] END ..................clf__C=10, encoder__smoothing=1.0; total time=   0.0s
[CV] END ..................clf__C=10, encoder__smoothing=1.0; total time=   0.0s
[CV] END ................clf__C=10, encoder__smoothing=100.0; total time=   0.0s
[CV] END ................clf__C=10, encoder__smoothing=100.0; total time=   0.0s
[CV] END ..............clf__C=10, encoder__smoothing=10000.0; total time=   0.0s
[CV] END ..............clf__C=10, encoder__smoothing=10000.0; total time=   0.0s

Process finished with exit code 0

Can we close the issue then?

I think so yws