scikit-learn-contrib/category_encoders

Multidimensional/composite target encoding

sunishchal-recentive opened this issue · 4 comments

I would like to do target encoding on the composite of multiple columns, but the current functionality only allows a single column to be encoded.

For example: I have a column names product and another named color and I'd like a unique target encoding value for each product+color combination. Currently, I can only have an encoding for each unique product and each unique color separately.

The workaround would be to concatenate the column values together and then target encode, but that is a bit clunky and leads to some unnecessary categorical features in my dataframe. Let me know if this is something worth raising a PR for.

The implementation I'm thinking is optionally allowing a new argument called something like composite_cols (open to better naming suggestions). This arg can be a list of lists, where each inner list indicates the column names to be concatenated together, and each element in the outer list makes up a composite column. If passed, convert the values to string and concatenate them together before passing into the encoder the same way as regular cols. The composite column can be named as the concatenation of all its component column names.

If we'd implement it for TargetEncoder we'd need it also for all other encoders where each column is encoded independently (which are all encoders except hashing).
I think the library should focus on just encoding and not do these kinds of feature engineering. My subjective opinion is that leaving concatenation to the user is the way to go. Is it really that clunky? It's just a line of code is it? What exactly do you mean by leads to some unnecessary categorical features? Do you mean that if you concat product and color to productcolor it will also encode product and color (if you do not explicitely specify columns)? On this topic I agree it's annoying.
Maybe we could not change all the encoders but offer some preprocessing functions? There could be a module preprocessing with a function create_composite_columns(input_df: pd.DataFrame, composite_cols: List[List[str]]) -> pd.DataFrame that will concatenate the columns and drop the individual cols for encoding.

Great! I like the preprocessing idea. I will scope it out and work on a PR for this in the next couple weeks.