scikit-learn-contrib/category_encoders

[question] Conditional encoding in a pipeline

flokde opened this issue · 1 comments

Hi there,

I've really been enjoying the package and have been using it together with sklearn pipelines. It is super intuitive and user friendly as it ties in so nicely with sklearn.

I am at a point, where I would like to dynamically create pipelines and ran into a question: I want to a kind of conditionally encode my categorical features, e.g. all features with less than 50 unique values should be encoded via one hot encoding and all others should be encoded via target encoding.

I currently do this the following way (which doesn't feel ideal): I create a pipeline and once I know what the training set will be, I filter out the columns according to my condition and insert a one hot encoding step as well as a target encoding step in my pipeline with the respective columns.

Is there a way to encode this behavior into the pipeline before knowing the dataset (basically the under/above 50 unique values from above)? Manually filtering the columns every time just doesn't feel like it is the best solution and seems messy.

Thank you so much and all the best,
flokde

Is there a way to encode this behavior into the pipeline before knowing the dataset (basically the under/above 50 unique values from above)?

No there can't be since the columns depend on the data.
I think the easiest solution to your problem is first to count distinct values per column and determine encoder to use and then use that encoder in a pipeline

cat_cols = util.get_obj_cols(df)
ohe_cols = [c for c in cat_cols if df[c].nunique() < 50]
te_cols = [c for c in cat_cols if df[c].nunique() >= 50]