scikit-learn-contrib/category_encoders

Unique levels, smoothing, and QuantileEncoder

bmreiniger opened this issue · 16 comments

The test ./tests/test_encoders.py::TestEncoders::test_unique_column_is_not_predictive fails for QuantileEncoder. That new supervised encoder wasn't added to this test.

My cursory understanding is that the other supervised encoders smooth things so that unique levels just get encoded with the prior. Is that really desired in general? It seems like a clunky exception, but if it is desired, can/should QuantileEncoder be adapted to do the same (@cmougan)?

Discovered while trying to refactor the test to use a supervised-encoder tagging, #326

Is that really desired in general?

That's a good question. I've read trough the target-encoder paper and there is no hard rule saying single-value categories should be assigned the prior. This was also raised by another user in #275
Usually the algorithms build some average of the category mean and the prior weighted by the size of the category and a regularisation parameter. In case of a single category column we'd need to ensure that the regularisation parameter puts enough weight on the prior to not overfit. However, the default regularisation parameters we have set in the target encoder are just too low.
In my opinion the better option would be to change the defaults there. However this might break some users code if they rely on the default staying the same. Proper way would be maybe to introduce future warnings

Hi @bmreiniger! Long time no see.

About unique levels of a category. It's something that depends on the regularization.

There are several types of regularization: leave-one-out, exponential smoothing, Gaussian noise...

Quantile Encoder uses m-estimate (also known as additive smoothing)
It's the most basic one (and for me the most intuitive). Unique levels of a category get regularized with the M parameter and the size of the level category (in this case = 1).

  • Which regularization technique es best? I have never seen a research study benchmarking them
  • Should we encode unique elements with the prior? I don't have anything in favor nor anything against it. The safest way to do it is to benchmark regularization techniques in scenarios where individual levels of a category are present

Eventually, in my imaginary world.

The user should be able to choose the desired regularizator as a parameter.
This will actually unify some of the current encoding methods : Target Encoder, M-estimator, Catboost, LeaveOneOut

To me, the line between a categorical level with just one observation or two is not so big; I wouldn't much trust the two-observation level either. So I would prefer not to have such a discontinuous treatment, especially hard-coded and out of reach of users. In the way of letting users control it, maybe having a parameter for minimum level size (what I originally thought min_samples_leaf was supposed to do) would be appropriate.

I'd be happily in favor of higher regularization values by default, with a future warning for a few releases if that's required.

I like Carlos's idea of generalizing regularization methods for many of the supervised encoders, for the long term. I worry about exploding the number of parameters though; maybe something like a separate regularizer class that gets passed as a parameter? That would basically leave just target, WOE, and quantile as the encoders, with m-estimate, minimum level size, catboost, LOO, James Stein, glmm as regularizers for them?

Benchmarking regularization in target encodings is not done to the best of my knowledge. It's actually a non-trivial experiment. I don't have a good answer on how to handle levels with just one observation besides common regularization.

Also, from an error and practical perspective. In a healthy dataset, if you have just one instance in train, there should be low chances that you have it in test. How much does this affect the error? Or we care about how the algorithm is constructed? Open question.

For default hyperparameters, I am doing some experiments for a paper, let me see if I get some results that we can use. Still, further experimentation is needed (might lead to a research paper)

For unifying supervised encoders, it might be good to sketch some diagrams to visualize the merge. A release like that might affect many users, and there are a fraction of users that won't understand the merge. @PaulWestenthanner what do you think?

@bmreiniger, during a recent paper (https://arxiv.org/pdf/2201.11358.pdf) we study the impact of regularization in target encoding (see Figure 2).

I was surprised to see that, for this particular case, regularization does not imply an improvement in model performance or is very minimal. (other datasets will be different). -- This will be an example that high default regularization wont help the user.

From a technical point I agree with @bmreiniger that making a hard cut between once-observed labels and twice (or more often) observed labels does not make too much sense. So we should increase the default values for regularisation.

From a point of releasing those I'd first include some future warning to all encoders where default parameters will change and only the release after actually change it. Also we should introduce a changelog / release overview / what's new page.

About unifying supervised encoders: I agree we'd need a sketch first and only then decide. I think from a user's point of view it is more explicit to have all target encoders separate rather than just having one with a lot of functionality. If there are many keyword arguments specific to a particular encoder it makes more sense to keep them separate. From a coding point of view it's obviously nice to unify things. However that change would be rather big. Maybe for some version 3.x?

Hi guys,

I'm planning to do a release shortly and add the FutureWarnings for those who use the default parameters at the moment. As for new default parameters I just did some analysis:

target_encoder_smoothing

This plot shows the value of lambda as given in the paper plotted for n_samples from 1 to 100 for different k (=min_samples_leaf) and f (=smoothing). Based on this broad parameter range I've plotted one in greater detail: k=20, f=10

target_encoder_smoothing_k20f10

I would suggest k=20 and f=10 as new default parameters
This gives

  • 95% global mean for n=1
  • 75% global mean for n=10
  • 50% global mean for n=20
  • 75% specific mean for n=30
  • almost all specific mean for n>50

Obviously these are hyper-parameters that need to be tuned according to the specific problem at hand but I guess the scores I give above are a good default.
Does anyone disagree?

For completeness sake I'll add the code

from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np

def lambda_coef(n, min_samples_leaf, smoothing):
    return 1 / (1 + np.exp(-(n - min_samples_leaf) / smoothing))

min_samples = [1,3,5,8,13,21,34,55]
smoothing_params = [2**x for x in range(0,7)]

fig, axes = plt.subplots(len(min_samples), len(smoothing_params), figsize=(25,25))
x_axis = range(1, 100)
for idx1, min_s in enumerate(min_samples):
    for idx2, smoothing in enumerate(smoothing_params):
        y_axis = [lambda_coef(n, min_s, smoothing) for n in x_axis]
        title = title=f"k={min_s};f={smoothing}"
        axes[idx1, idx2].plot(x_axis, y_axis)
        axes[idx1, idx2].set_title(title)
``

Hi @PaulWestenthanner!

Very interesting :)

Should not this depend on the used estimator?

What happens if you apply this approach to the other regularizers? M-estimate and Gaussian Noise?

These default hyperparameters won't apply to all encoding methods, will they?

Hi,

yes the best values will depend on the used estimator. The thing we're discussing in this issue is that the default values for smoothing and min_samples_leaf are just selected very poorly.
The workflow should always include optimising the hyper-parameters of the encoders, but the encoder should start with a sensible default anyway. This is not the case at the moment as the analysis above shows.
This analyis is for the TargetEncoder only. A similar analysis should be done for M-estimate and other variations of target encoding

Hi @PaulWestenthanner,

If the regularization hyperparameter depends on the estimator and on the dataset used. Why this analysis is supposed to improve the default hyperparameters for Target Encoder?
What am I missing here?

That's because the new defaults might be bad for some encoders but the current ones are bad for pretty much all encoders.
The current behaviour is that if a category occurs only twice (or with correct implementation only once) then the encoder leans heavily towards the category average. This will lead to overfitting. Also if people start hyper-parameter optimisation they should start from sensible defaults rather than defaults that are probably pretty bad for any model

I understand the need for "optimal" default hyperparameters. But how do you choose an optimal default hyperparameter when it will depend on both the data and the estimator?

For example "I would suggest k=20 and f=10 as new default parameters" why not k=10 and f=20?
How does your previous experiment help to assess this decision? I think I am missing something here, that apparently seems obvious.

I chose it because I think the values for lambda make sense

I would suggest k=20 and f=10 as new default parameters
This gives
95% global mean for n=1
75% global mean for n=10
50% global mean for n=20
75% specific mean for n=30
almost all specific mean for n>50

This is basically just my gut feeling saying that 50 data points is enough to trust the label.
If you compare it to k=10, f=20 as you suggest you get a much flatter S-shaped curve with these stats:

  • 40% specific mean for n=1
  • 50% specific mean for n=20
  • 70% specific mean for n=30
  • 90% specific mean for n=50

This is much more likely to overfit since a unique level is weighed with 40% of its specific mean and also on the other end of the spectrum if a level occurs 50 times I'd be quite happy to give more than 90%. So that's why I like k=20, f=10 better than k=10, f=20.

For reference the current defaults k=1, f=1 give those statistics:

  • 50% specific mean for n=1
  • 90% specific mean for n=3
  • 99% specific mean for n>7

I hope my argumentation makes sense for you even though it is not based on strict science but is rather a heuristic on what seems to make sense and provide a good starting point for hyperparameter optimisation

I see. Thanks :)

I have been wondering for a while if there is any more methodological way to estimate default hyperparameters. The problem also is relevant in the scikit-learn library.