Giskard-AI/giskard-client

GSK-317: Conflict between `pd.get_dummies` and `_validate_model_execution` in giskard-client

rabah-khalek opened this issue · 0 comments

Related to: Giskard-AI/giskard-examples#1 (comment)

This is a case scenario where a custom prediction_function that contains manually written transformers (instead of sklearn pipeline) is not guaranteed to pass Giskard's test, even if it is functioning correctly.

if prediction_function contains pd.get_dummies(df, columns = one_hot_encoding_columns) (the manual way to perform one-hot encoding), it will most likely fail in:

def _validate_model_execution(self, prediction_function, df: pd.DataFrame, model_type,
classification_labels=None, target=None) -> None:
if target is not None and target in df.columns:
df = df.drop(target, axis=1)
try:
prediction_function(df.head(1))
except Exception:
raise ValueError("Invalid prediction_function input.\n"
"Please make sure that prediction_function(df.head(1)) does not return an error "
"message before uploading in Giskard")

The reason is, pd.get_dummies infers the one_hot encoding categories from available rows. If only 1 entry is passed (via head(1)), it will fail to get all the categories.

Here's an example (where the total number of categories after transformation should be 40 and the raw categories are 19):

pd.get_dummies(test_data.head(1), columns = one_hot_encoding_columns).shape
>> (1, 19)

increasing the number of rows:

pd.get_dummies(test_data.head(2), columns = one_hot_encoding_columns).shape
>> (2, 25)
pd.get_dummies(test_data.head(10), columns = one_hot_encoding_columns).shape
>> (10, 40)

Possible solutions:

  1. Either we strongly recommend to the user to always create his transforms in a way independent of the input dataframe, similar to the case of input normalisation that is implemented in section.2 of https://github.com/Giskard-AI/examples/blob/task/GSK-308_Churn_prediction_model/Churn_Telco_Kaggle_without_transformers.ipynb (where the min and max of the features are gathered outside the prediction_function)
  2. we increase the rows passed in _validate_model_execution

I'm in favour of (1.), I think we should always stress on the user to make his transform dataframe-independent.