GSK-317: Conflict between `pd.get_dummies` and `_validate_model_execution` in giskard-client
rabah-khalek opened this issue · 0 comments
Related to: Giskard-AI/giskard-examples#1 (comment)
This is a case scenario where a custom prediction_function
that contains manually written transformers (instead of sklearn
pipeline) is not guaranteed to pass Giskard's test, even if it is functioning correctly.
if prediction_function
contains pd.get_dummies(df, columns = one_hot_encoding_columns)
(the manual way to perform one-hot encoding), it will most likely fail in:
giskard-client/giskard/client/project.py
Lines 455 to 464 in d20c0a1
The reason is, pd.get_dummies
infers the one_hot encoding categories from available rows. If only 1 entry is passed (via head(1)
), it will fail to get all the categories.
Here's an example (where the total number of categories after transformation should be 40 and the raw categories are 19):
pd.get_dummies(test_data.head(1), columns = one_hot_encoding_columns).shape
>> (1, 19)
increasing the number of rows:
pd.get_dummies(test_data.head(2), columns = one_hot_encoding_columns).shape
>> (2, 25)
pd.get_dummies(test_data.head(10), columns = one_hot_encoding_columns).shape
>> (10, 40)
Possible solutions:
- Either we strongly recommend to the user to always create his transforms in a way independent of the input dataframe, similar to the case of input normalisation that is implemented in section.2 of https://github.com/Giskard-AI/examples/blob/task/GSK-308_Churn_prediction_model/Churn_Telco_Kaggle_without_transformers.ipynb (where the min and max of the features are gathered outside the
prediction_function
) - we increase the rows passed in
_validate_model_execution
I'm in favour of (1.), I think we should always stress on the user to make his transform dataframe-independent.