Credit Risk sees good loans outnumbering risky ones. Therefore, it is important to employ various training and evaluation techniques so that the model can get a good understanding of the data.
In this project, we will be using imbalanced-learn and scikit-learn libraries to build and evaluate models using resampling.
We will evaluate three machine learning models and determine which is the best for predicting credit risk. You can find the code for this part of the project here
The steps involved in this analysis are as follows:
Before we start, import all the dependencies for this project.
STEP 1: Transform the data into a usable form which involves:
- Loading the data
- Dropping NULL values from colummns and rows
- Converting strings to numerical datatypes
- Converting target column values to High Risk and Low Risk based on their values
STEP 2: Split the data into Training and Testing sets
Going a little further, we can
- Check the balance of target values
- Check the shape of the X training set
STEP 3: Oversampling: Here you will compare two oversampling algorithms to determine which perfomrs better.
- Using Naive Random Oversampling
- Using SMOTE Oversampling
STEP 4: Undersampling: Let us use Cluster Centroids Algorithm here.
STEP 5: Over and Under Sampling (SMOTEENN)
Here, we will use imblearn.ensemble, _BalancedRandomForestClassifier and EasyEnsembleClassifier to predict credit risk and evaluate each model.
You can find the code for this part of the project here
Before we start, ensure you have installed all the necessary libraries. If not, do a quick pip install imblearn and pip install -U scikit-learn. Bring in all the dependencies as well.
STEP 1: Much like the before, bring in the CSV and clean it up so it can be used for risk analysis and testing.
STEP 2: Split the data into Training and Testing sets
STEP 3: Ensemble Learners: Here, you will train a Balanced Random Forest Classifier and an Easy Ensemble AdaBoost classifier to see which one gives better results.
- Balanced Random Forest Classifier:
- List the features sorted in descending order by feature importance
- Easy Ensemble AdaBoost Classifier
Let us compare the various results:
Naive Oversampling Results
SMOTE Results
Cluster Centroid Results
SMOTEENN Results
Easy Ensemble Results
From these results, we notice very low precisions for High_Risk factor. This is an indication of a large number of False Positives.