scikit-learn-contrib/category_encoders

TargetEncoder instance can not be used different dataframes with different number of columns, even though the number of columns has nothing to do with the encoder

jconwell opened this issue · 8 comments

Expected Behavior

I should be able to encode my features, train my model, and then persist any encoders used to create the features, as well as the model to be run against production data to generate classifications.

During production classification, the encoders and model are re-hydrated, the incoming data is loaded and run through the different encoders to generate my feature space, then run against the model to generate classification results.

Actual Behavior

During production classification, my incoming dataframe has fewer columns from the dataframe that was used when the the encoders were created. But the actual columns to be encoded do exist in the incoming dataframe.

When I try to run the encoders on the dataframe I get the exception:

  • raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim,))

The code for this check is here:

if X.shape[1] != self._dim:

This error check is not needed as nothing about the encoder has anything to do with the number of columns in the dataframe. As long as the column I passed in when I fit the encoder exists in the dataframe that I want to run transform on, it shouldn't matter how many columns are in the dataframe.

I'm confused how anyone uses this encoder, or any encoder that has the same error check in a production environment. Say I have a dataframe with 10 columns, where each column will get run through the TargetEncoder, but I want to preserve the original columns that get encoded. So the first encoder will get self._dim set to 11. The second encoder will get self_dim set to 12. and so on and so forth. Now I have a set of persisted encoders, each one with a hard coded error check for different number of columns, even though the number of columns in the dataframe don't matter to the encoder.

Steps to Reproduce the Problem

  1. create a dataframe with 10 columns (df1), including a column named "col_to_target_encode"
  2. run TargetEncoder on column "col_to_target_encode"
  3. pickle TargetEncoder
  4. create a dataframe with 9 columns (df2), including a column named "col_to_target_encode" and populated from the same distinct set of values from df1.col_to_target_encode
  5. load pickled TargetEncoder
  6. run TargetEncoder on column "col_to_target_encode" in df2
  7. An exception is raised "Unexpected input dimension 9, expected 10"

Specifications

  • Version: 2.4.0
  • Platform: OSX
  • Subsystem: python 3.7

I assume that this error check is in other encoders as well, not just the TargetEncoder

Hi @jconwell

I understand you problem but I think this is a very reasonable design choice:
In sklearn the fit function of all machine learning models works on the whole dataframe. So if you were to keep unencoded columns after the categorical encoding, you would need to drop them manually anyway before fitting a model. Since the model will also check dimensions it makes sense to also check them in the encoder. The idea of this project is to be as aligned with sklearn as possible and follow their conventions.
So if you'd like to have a dataframe with both encoded and unencoded data the easiest would probably be to encode the data and then use pd.concat to merge them together.

I'm gonna have to disagree with you on this, thought it probably won't matter. First off we're talking about encoders, not models. Also, in large complex production scenarios you often don't go from raw data to building models to generating predictions in one simple script. These steps are broken up into multiple phases, the output of each phase persisted in a ML data / model mgt framework, and then called by different processes at different times.

Having a robust feature generating process that can pre generate a diverse set of features, store them, and then put together later for multiple different models is important for a mature ML framework.

To your point that the model will fit on all columns, that is easily handled with something like:

ml_features = ["stuff", "things", "blah"]
feature_df = big_dataframe_with_many_columns[ml_features]
model = model_algorithm.fit(feature_df, y_train)

which you'd probably do in the scenario where you have a feature store and are pulling together a set of disparate features to build your model.

Again, we're talking about encoders here, not models. There is no need to put the same constraints you would on a model, on an encoder.

Also, if you really want to put good constraints on the dataframe being passed into a model, the number of columns isn't a very strong constraint. You'd probably want to store a dataframe schema, complete with data types, and validate any incomming dataframe against that

I see that it might make sense to save some intermediate results in some use cases.
However, the goal of this library is to work as smoothly and as integrated with sklearn as possible. This includes supporting sklearn pipelines (which also many people use in a complex production setting). If an encoder also keeps unencoded features it cannot be used in a pipeline (which also makes hyper-parameter optimization difficult). This is a no-go for us.
Also, sklearn offers some feature engineering / feature preprocessing transforms itself (like Standard Scaler https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). They also do not have the option of keeping the original / unprocessed data. So I think we should follow suit here. Here you could also argue that we're only talking about preprocessing and not modelling, but in the end after all preprocessing there will always follow a model.
Of course selecting the features manually before fitting the model is not a lot of work as you point out, but concatenating un-processed and processed features isn't more work either in case you want to save intermediate results.

I kind'a feel like you are digging your heels in just to dig your heels in and not make a code change. The number of columns check in the encoder is not a valid check for what you are describing. It's just an arbitrary column count check, that has no context of the dataframe schema it's expecting. If you really want to make sure the encoder is acting on the properly structured dataframe then do that. But don't put an arbitrary check in as a half measure

I agree that checking the input data could be done better than just checking the number of columns. At least the types should be checked

If anyone else runs into this, you can work past it by setting the encoder._dim and it's internal ordinal_encoder._dim to the shape of the dataframe you are using

encoder.ordinal_encoder._dim = my_df.shape[1]
encoder._dim = my_df.shape[1]

yay python