The purpose of this analysis is to use the credit card credit dataset from LendingClub, a peer-to-peer lending services company, to oversample and undersample the data in order to compare different machine learning models that reduce bias and predict accurate credit risk. In order to achieve this, the ClusterCentroids algorithm and the SMOTEENN algorithm are utilized along with imbalanced-learn and scikit-learn libraries to build and evaluate models using resampling.
The results for the Naive Random Oversampling method are found below:
- As seen above, the Balanced Accuracy Score for the Naive Random Oversampling method is 64.66%
- The Precision High Risk is 1%
- The Precision Low Risk is 100%
- The Recall High Risk is 74%
- The Recall Low Risk is 55%
The results for the SMOTE Oversampling method are described below:
- As seen in the analysis, the Balanced Accuracy Score for the SMOTE Oversampling method is 66.24%
- The Precision High Risk is 1%
- The Precision Low Risk is 100%
- The Recall High Risk is 63%
- The Recall Low Risk is 69%
The following results are indicated from the ClusterCentroids Undersampling method:
- The Balanced Accuracy Score is 54.42%
- The Precision High Risk is 1%
- The Precision Low Risk is 100%
- The Recall High Risk is 67%
- The Recall Low Risk is 42%
The results for the SMOTEEN Sampling method are indicated as follows:
- The Balanced Accuracy Score is 64.00%
- The Precision High Risk is 1%
- The Precision Low Risk is 100%
- The Recall High Risk is 70%
- The Recall Low Risk is 58%
Here are the results for the Balanced Random Forest Classifier method:
- The Balanced Accuracy Score is 78.85%
- The Precision High Risk is 3%
- The Precision Low Risk is 100%
- The Recall High Risk is 70%
- The Recall Low Risk is 87%
Finally, the Easy Ensemble AdaBoost Classifier method indicates the following results:
- The Balanced Accuracy Score is 93.17%
- The Precision High Risk is 9%
- The Precision Low Risk is 100%
- The Recall High Risk is 92%
- The Recall Low Risk is 94%
In terms of overall performance, the methods that score the highest in the Balanced Accuracy category are:
- Easy Ensemble AdaBoost Classifier, scoring 93.17%
- Balanced Random Forest Classifier, scoring 78.85%
In order to detect if a loan is high risk, the methods described above will have to demonstrate their capacity to flag high risk loans. The methods that score the highest for the recall high risk category and are therefore capable of detecting high risk loans more effeciently are:
- Easy Ensemble AdaBoost Classifier with a score of 92%
- Naive Random Oversampling with a score of 74%
- SMOTEEN Sampling with a score of 70%
- Balanced Random Forest Classifier with a score of 70%
In contrast, the SMOTE Oversampling model only has a score of 63%, and would subsequently not be as efficient as the other methods in detecting high risk loans.
To conclude, the Easy Ensemble AdaBoost Classifier appears to be the most dependable when detecting high risk loans as it has a high rate of true positive detection, and can therefore be recommended as a reliable machine learning model capable of reducing bias and predicting accurate credit risk.