PDP plot: use train or test set in pdp_isolate()?
FernandoDoreto opened this issue · 4 comments
Hi @SauceCat , quick conceptual question:
Say if I selected I given feature to analyze in my multiclass classifier.
In pdp.pdp_isolate()
function for PDP plot, when would make sense to use the train set or test set to fill dataset
parameter?
Initially, I'd say it is more complete to build 2 PDP plots for the same feature, one using train set and another using test set. So you can verify if that feature is having an equivalent impact on both sets. But I am interested in your thoughts.
Regards, Fernando
training set
Hello, you bring up an excellent point regarding the calculation of partial dependence plots (PDPs). Indeed, PDPs can be calculated on any dataset, whether training or test.
Personally, I lean towards using the test set for this purpose, mainly because the model may overfit to the training data. However, using the test set brings its own set of challenges, mainly that the test set usually isn't as large as the training set. Consequently, it might not be fully representative and may not be ideal for calculating partial dependence values, given that the sample size might not be large enough to average out side effects.
Therefore, a balanced approach could be to compute the PDPs on both training and test sets. This way, you can compare and spot any surprising discrepancies. If the impact of the features differs significantly between the training and test sets, it could suggest a mismatch in the distributions of your training and test data. Alternatively, it could imply overfitting of your model to the training data.
As an added measure, you could also employ k-fold cross-validation. This allows you to check for fluctuations in the PDPs across different folds and provides further insight into the robustness of your model's responses to the features. This way, you can make a more informed decision on the reliability of the PDPs.
Now that we're in 2023, I've also gathered some insights from ChatGPT:
Partial Dependence Plots (PDPs) are used to visualize the marginal effect one or two features have on the predicted outcome of a machine learning model. They can be computed and visualized using either training set or test set data.
However, it's more common and generally recommended to use the test set data for this purpose. The reason is that PDPs show how a model makes predictions for different feature values, and this is more meaningful and insightful when evaluated on unseen data (i.e., the test set). Evaluating on the test set can provide a better understanding of how the model generalizes to new data.
Remember that while PDPs are a powerful tool, they also make certain assumptions (like independence between the features) and might not capture complex interactions in some cases, particularly for highly correlated features or non-monotonic relationships. As always with such analysis tools, use them as part of a broader toolkit and consider the specific context and characteristics of your data and model.
Hi to all. I add it as comment because its related with PdpIsolate. (Apologize if i post it in wrong place.)
There is a confusing part in the parameters list of PDPIsolate:
model, df, model_features, feature, feature_name
model features, feature, and feature name? what should i provide?
Documentation
Because the documentation didnt explain all parameters so its kinda confusing.