In this project I am going to use machine learning to create prediction models for this classification problem of credit risk. Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. I will be using different techniques to train and evaluate models with unbalanced classes.
- BACKGROUND INFO:
-
Precision Score= (Predicted True/ (Predicted True + False Positive)
-People that were positive, we want to know the likelihood of actually being positive.
-
Recall Score= (Predicted True/ Predicted True + False Negative)
-Person knows has a good loan status, but wants to know what the loan officer will give.
BalancedRandomForestClassifier
- Accuracy_score (r_squared)= .79
- Precision= .99
- Recall= .85
Ensemble AdaBoost Classifier
- Accuracy_score (r_squared)= .91
- Precision= .99
- Recall= .93
Naive Random Oversampling w/ Logistic Regression
- Accuracy_score (r_squared)= .68
- Precision= .99
- Recall= .68
SMOTE Oversampling w/ Logistic Regression
- Accuracy_score (r_squared)= .66
- Precision= .99
- Recall= .69
ClusterCentroids Undersampling w/ Logistic Regression
- Accuracy_score (r_squared)= .60
- Precision= .99
- Recall= .53
Combination (Over and Under) Sampling w/ Logistic Regression
- Accuracy_score (r_squared)= .66
- Precision= .99
- Recall= .64
Overall, the best model that was generated to predict the unbalanced classification problem of credit risk is the Ensemble AdaBoost Classifier as well as the Balanced Random Forest Classifier due to their high precision, recall, and accuracy scores being near 1 which is what we want in this prediction problem. But I would recommend the Ensemble AdaBoost Classifier due to its higher overall scores for a classification prediction.