GSK-317: Conflict between `pd.get_dummies` and `_validate_model_execution` in giskard-client

Related to: Giskard-AI/giskard-examples#1 (comment)

This is a case scenario where a custom prediction_function that contains manually written transformers (instead of sklearn pipeline) is not guaranteed to pass Giskard's test, even if it is functioning correctly.

if prediction_function contains pd.get_dummies(df, columns = one_hot_encoding_columns) (the manual way to perform one-hot encoding), it will most likely fail in:

giskard-client/giskard/client/project.py

Lines 455 to 464 in d20c0a1

    
           def _validate_model_execution(self, prediction_function, df: pd.DataFrame, model_type, 
        
                                         classification_labels=None, target=None) -> None: 
        
               if target is not None and target in df.columns: 
        
                   df = df.drop(target, axis=1) 
        
               try: 
        
                   prediction_function(df.head(1)) 
        
               except Exception: 
        
                   raise ValueError("Invalid prediction_function input.\n" 
        
                                    "Please make sure that prediction_function(df.head(1)) does not return an error " 
        
                                    "message before uploading in Giskard")

The reason is, pd.get_dummies infers the one_hot encoding categories from available rows. If only 1 entry is passed (via head(1)), it will fail to get all the categories.

Here's an example (where the total number of categories after transformation should be 40 and the raw categories are 19):

pd.get_dummies(test_data.head(1), columns = one_hot_encoding_columns).shape
>> (1, 19)

increasing the number of rows:

pd.get_dummies(test_data.head(2), columns = one_hot_encoding_columns).shape
>> (2, 25)

pd.get_dummies(test_data.head(10), columns = one_hot_encoding_columns).shape
>> (10, 40)

Possible solutions:

Either we strongly recommend to the user to always create his transforms in a way independent of the input dataframe, similar to the case of input normalisation that is implemented in section.2 of https://github.com/Giskard-AI/examples/blob/task/GSK-308_Churn_prediction_model/Churn_Telco_Kaggle_without_transformers.ipynb (where the min and max of the features are gathered outside the prediction_function)
we increase the rows passed in _validate_model_execution

I'm in favour of (1.), I think we should always stress on the user to make his transform dataframe-independent.

	def _validate_model_execution(self, prediction_function, df: pd.DataFrame, model_type,
	classification_labels=None, target=None) -> None:
	if target is not None and target in df.columns:
	df = df.drop(target, axis=1)
	try:
	prediction_function(df.head(1))
	except Exception:
	raise ValueError("Invalid prediction_function input.\n"
	"Please make sure that prediction_function(df.head(1)) does not return an error "
	"message before uploading in Giskard")