This repository provides a robust framework for conducting feature importance and vulnerability analysis in machine learning models, specifically designed for tabular data. The framework addresses two primary business problems:
- Improving the prediction of marketing campaign success for term deposits using the bank marketing dataset
- Enhancing the assessment of credit risk using the German credit risk dataset
- Implement a robust framework for analyzing and improving machine learning models.
- Conduct uncertainty, feature importance, and feature performance analyses to identify model weaknesses and vulnerabilities.
- Enhance the performance of marketing and credit risk models by improving their feature selection and understanding the impact of individual features on model predictions.
-
Marketing Model:
- Marketing Baseline
- Provides insights and predictions to optimize campaign strategies, targeting the most likely customers for term deposits.
- Baseline ROC-AUC: 0.89
- Our improvement: Increase ROC-AUC by 2% (0.91)
-
Credit Risk Model:
- Credit Risk Baseline
- Offers more accurate risk assessments to inform lending decisions and reduce default rates.
- Baseline ROC-AUC: 0.78
- Our improvement: Increase ROC-AUC by 2% (0.80)
- Iterate over the
analyses_to_run
list. - Check if each analysis is defined in the
analysis_methods
dictionary. - For each valid analysis, iterate over the
trained_pipelines
. - Execute the specified analysis method on each pipeline.
Three specific types of analyses are defined in the ModelImprover
class:
-
Uncertainty Analysis
- Purpose: Understand the confidence level of the model's predictions.
- Method: Uses a baseline ensemble Monte Carlo method to calculate the uncertainty of the model's predictions.
-
Feature Importance Analysis
- Purpose: Determine the importance of different features used by the model.
- Method: Uses SHAP values to plot feature importance and SHAP summary plots, and selects features based on their SHAP values.
-
Feature Performance Analysis
- Purpose: Analyze the performance of individual features in contributing to the model's predictions.
- Method: Assesses how changes in feature values affect model accuracy or other performance metrics, identifying weaknesses in the model's use of certain features.
-
Uncertainty
- Methods:
baseline_ensemble_monte_carlo
: Calculates the uncertainty of the model's predictions using ensemble Monte Carlo simulations.
- Methods:
-
Explainability
- Methods:
plot_feature_importance
: Plots the importance of each feature.plot_shap_summary
: Creates a SHAP summary plot.select_features_based_on_shap
: Selects features based on their SHAP values.
- Methods:
-
FeaturePerformanceWeaknessAnalyzer
- Methods:
analyze_feature_performance
: Analyzes the performance of individual features.plot_metric_drops
: Plots the performance drops for vulnerable features.
- Methods:
-
Clone the repository:
git clone https://github.com/zahere/MLOps-24.git cd MLOps-24
-
Install the dependencies:
pip install -r requirements.txt
- Prepare your datasets and ensure you add the configuration json to /config folder .
- Define your pipelines and train your models.
- Configure the
analyses_to_run
list and theanalysis_methods
dictionary. - Execute the analyses using the
ModelImprover
class.
This project is licensed under the MIT License - see the LICENSE file for details.
- Special thanks to Dr. Ishai Rosenberg - MLOps 24 Course (Y-Data)