A dataset of 495,242 loans with 144 recorded variables was explored and a machine learning model was trained to predict whether someone would default on their loan.
The final model was a random forest classifier, which obtained 82% accuracy, an AUC of 0.90 and an F1 score of 0.85.
Four CSV files available here: Loans2018Q1, Loans2018Q2, Loans2018Q3, Loans2018Q4.
- dimensions
- variables
- missing data
- relationships between loan size, loan grade, purpose of loan, property ownership and application type
- removing empty columns
- removing columns with large percentage of missing data
- modifying data types
- label encoding of strings to numerical
- creating dummy variables for low cardinality variables
- creating bins for loan amount and annual income
- imputing missing data as median
- Using TPOT to identify best model and hyperparameters
- Using Parfit to perform Grid Search
- Training a random forest with hyperparameters identified above
- Assessing accuracy, AUC, F1 score and plotting confusion matrix
- Calculating gini importances
- Plotting precision-recall curve
Calculating the Gini Importances for the features demonstrated a good spread across the top 30 variables. The most strongly predictive variables are those related to recoveries, late fees, total loan amount and interest rate and grade, all of which make intuitive sense.