/Credit_Risk

Supervised Machine Learning and Credit Risk

Primary LanguageJupyter NotebookMIT LicenseMIT

About The Project

In 2019, more than 19 million Americans had at least one unsecured personal loan. That’s a record-breaking number! Personal lending is growing faster than credit card, auto, mortgage, and even student debt. With such incredible growth, FinTech firms are storming ahead of traditional loan processes. By using the latest machine learning techniques, these FinTech firms can continuously analyze large amounts of data and predict trends to optimize lending.

We used Python to build and evaluate several machine learning models to predict credit risk. Being able to predict credit risk with machine learning algorithms can help banks and financial institutions predict anomalies, reduce risk cases, monitor portfolios, and provide recommendations on what to do in cases of fraud.

Roadmap

  • Logistic Regression
  • Classification Model Validation
  • Support Vector Machines
  • Data Preprocessing in Machine Learning
  • Decision Trees
  • Ensemble Learning and Random Forests
  • Bagging and Boosting

Steps

The goals of this project are to:

  • Implement machine learning models.
  • Use resampling to attempt to address class imbalance.
  • Evaluate the performance of machine learning models using scikit-learn library.

Tasks

  1. Oversample the data using the RandomOverSampler and SMOTE algorithms.

  2. Undersample the data using the cluster centroids algorithm.

  3. Use a combination approach with the SMOTEENN algorithm. For each of the above:

    1. Train a logistic regression classifier (from Scikit-learn) using the resampled data.
    2. Calculate the balanced accuracy score using balanced_accuracy_score from sklearn.metrics.
    3. Generate a confusion_matrix.
    4. Print the classification report (classification_report_imbalanced from imblearn.metrics).

Analysis

Oversampling

Naive Random Oversampling

The accuracy score for the random oversampling is 0.65.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not ideal.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.70      0.60      0.02      0.65      0.42       101
low_risk       1.00      0.60      0.70      0.75      0.65      0.42     17104

avg / total       0.99      0.60      0.70      0.74      0.65      0.42     17205

SMOTE Oversampling

The accuracy score for the SMOTE oversampling is 0.63.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not ideal and lower than the random oversampling.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.59      0.66      0.02      0.62      0.39       101
low_risk       1.00      0.66      0.59      0.79      0.62      0.39     17104

avg / total       0.99      0.66      0.59      0.79      0.62      0.39     17205

To sum up, the oversampling cases are not good.

Undersampling

The accuracy score for the random oversampling is 0.63.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not good.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.65      0.54      0.02      0.60      0.36       101
low_risk       1.00      0.54      0.65      0.70      0.60      0.35     17104

avg / total       0.99      0.54      0.65      0.70      0.60      0.35     17205

Combination (Over and Under) Sampling

The accuracy score for the random oversampling is 0.6.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not good.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.70      0.60      0.02      0.65      0.43       101
low_risk       1.00      0.60      0.70      0.75      0.65      0.42     17104

avg / total       0.99      0.60      0.70      0.75      0.65      0.42     17205

Recommendations

From the analysis above, all of the models above are not recommended. Because all the models have accuracy scores less than 0.7. The precision score for the credit scores are overfit apparently. The recall (sensitivity) is also not good. More detailed model to distinguish the features need to be establlished for a better prediction.

Extension

BalancedRandomForestClassifier improved the accuracy score a little bit to 0.74. But the precision and sensitivity still exist. EasyEnsembleClassifier is by far the best model, where the accuracy score is 0.94. The sensitivity has been improved as well. The classification_report_imbalanced is given as follows.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.10      0.92      0.95      0.18      0.94      0.87       101
low_risk       1.00      0.95      0.92      0.97      0.94      0.88     17104

avg / total       0.99      0.95      0.92      0.97      0.94      0.88     17205

License

Distributed under the MIT License. See LICENSE for more information.