scikit-learn-contrib/category_encoders

Feature Request: Count-Based Target Encoder (Dracula)?

bking124 opened this issue · 1 comments

I recently stumbled upon a categorical encoding idea dubbed "Distributed Robust Algorithm for Count-based Learning" (aka Dracula) described in this Microsoft blog as well as this talk. It seems like it mixes ideas of CountEncoder and TargetEncoder. Has anybody heard of this approach before and has there been thought of introducing such an encoder into the package? I'm interested to compare this approach with the typical TargetEncoder.

Thanks for the wonderful package!

Hi @bking124

I haven't heard of the approach before. Searching "Dracula Encoder" or "CTR encoder" (as mentioned in the talk) also doesn't yield much. Since the talk and blog post are already 8 years old and it didn't get much traction since I'd be surprised if yields great results.
On the other hand we could include it into the package. I think it should be rather straight forward to implement.
From what I understood the encoded value is calculated as:

  1. calculate the counts for each label df.groupBy(col, label).count(). This can be only done for the top N and the rest will go to a rest category
  2. use as encoded value for a label x: counts[x, target=0], counts[x, target=1], ..., log-odds, flag_is_rest

I'm not quite sure how to handle the regression case. Probably we'd need some binning of the target variable there?
Also small categories might result in overfitting if the classifier basically ignores the counts and just uses the log odds (which it will). This might be a potential issue (just like in target encoding with too little regularization).
In fact this is pretty much what you'd get when you encode a variable with both count encoder and target encoder (with no regularisation).