[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test'
Opened this issue · 0 comments
Describe the bug
I don't know if this is the intended behaviour or not, but when setting train_or_test
parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:
if self.train_or_test.lower() == 'test':
# keeping the same naming convenetion as to not add complexit later on
self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
self.y,
self.sample_weight,
test_size=0.3,
random_state=self.random_state,
stratify=self.stratify)
self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)
However, X_boruta_test
is not used anywhere else, in fact the whole dataset X
is used to derive feature importance, regardless of chosen train_or_test
, cf. lines 856 and 873 for importance_measure == 'shap'
:
self.shap_values = np.array(explainer.shap_values(self.X_boruta))
self.shap_values = explainer.shap_values(self.X_boruta)
and line 815 for importance_measure == 'perm'
:
perm_importances_ = permutation_importance(self.model, self.X, self.y, scoring='f1')
While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.
Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?
Thank you for your help.