Feature Importance and Vulnerability Analysis in ML Models

Project Overview

This repository provides a robust framework for conducting feature importance and vulnerability analysis in machine learning models, specifically designed for tabular data. The framework addresses two primary business problems:

Improving the prediction of marketing campaign success for term deposits using the bank marketing dataset
Enhancing the assessment of credit risk using the German credit risk dataset

Objectives

Implement a robust framework for analyzing and improving machine learning models.
Conduct uncertainty, feature importance, and feature performance analyses to identify model weaknesses and vulnerabilities.
Enhance the performance of marketing and credit risk models by improving their feature selection and understanding the impact of individual features on model predictions.

Baselines and Metrics

Marketing Model:
- Marketing Baseline
- Provides insights and predictions to optimize campaign strategies, targeting the most likely customers for term deposits.
- Baseline ROC-AUC: 0.89
- Our improvement: Increase ROC-AUC by 2% (0.91)
Credit Risk Model:
- Credit Risk Baseline
- Offers more accurate risk assessments to inform lending decisions and reduce default rates.
- Baseline ROC-AUC: 0.78
- Our improvement: Increase ROC-AUC by 2% (0.80)

Analyses Execution

Iterate over the analyses_to_run list.
Check if each analysis is defined in the analysis_methods dictionary.
For each valid analysis, iterate over the trained_pipelines.
Execute the specified analysis method on each pipeline.

Analyses Methods

Three specific types of analyses are defined in the ModelImprover class:

Uncertainty Analysis
- Purpose: Understand the confidence level of the model's predictions.
- Method: Uses a baseline ensemble Monte Carlo method to calculate the uncertainty of the model's predictions.
Feature Importance Analysis
- Purpose: Determine the importance of different features used by the model.
- Method: Uses SHAP values to plot feature importance and SHAP summary plots, and selects features based on their SHAP values.
Feature Performance Analysis
- Purpose: Analyze the performance of individual features in contributing to the model's predictions.
- Method: Assesses how changes in feature values affect model accuracy or other performance metrics, identifying weaknesses in the model's use of certain features.

Utility Classes and Methods

Uncertainty
- Methods:
  - baseline_ensemble_monte_carlo: Calculates the uncertainty of the model's predictions using ensemble Monte Carlo simulations.
Explainability
- Methods:
  - plot_feature_importance: Plots the importance of each feature.
  - plot_shap_summary: Creates a SHAP summary plot.
  - select_features_based_on_shap: Selects features based on their SHAP values.
FeaturePerformanceWeaknessAnalyzer
- Methods:
  - analyze_feature_performance: Analyzes the performance of individual features.
  - plot_metric_drops: Plots the performance drops for vulnerable features.

Getting Started

Prerequisites

Installation

Clone the repository:

git clone https://github.com/zahere/MLOps-24.git
cd MLOps-24

Install the dependencies:
```
pip install -r requirements.txt
```

Usage

Prepare your datasets and ensure you add the configuration json to /config folder .
Define your pipelines and train your models.
Configure the analyses_to_run list and the analysis_methods dictionary.
Execute the analyses using the ModelImprover class.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Special thanks to Dr. Ishai Rosenberg - MLOps 24 Course (Y-Data)

zahere/MLOps-best-practices