Using loan data from Lending Club we use machine learning to predict the risk of loan defaults. Additionally, using the results from the predictive modeling, we improve the potential return on investment.
Details of the implementation can be found here.
Suppose a bank wants to know whether potential loan applicants will default on a loan. Loan information for a client is provided and a binary outcome of fully paid or default is predicted. We will use logistic regression, random forest, neural network, xgboost, and ensemble classifieres to create a model. This will provide useful metrics and help improve return on investment for the company.
Set of features:
Each row represents a client's financial informationloan_amnt | term | int_rate | installment | grade | emp_length | home_ownership | annual_inc | verification_status | loan_status | purpose | dti | delinq_2yrs | earliest_cr_line | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | total_pymnt | application_type | mort_acc | pub_rec_bankruptcies | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5000 | 36 months | 7.35% | 155.19 | A | 5 years | MORTGAGE | 60000.0 | Not Verified | Fully Paid | car | 15.76 | 0 | Oct-04 | 12 | 0 | 3697 | 13.20% | 25 | w | 5385.245133 | Individual | 1 | 0 |
Many models were trained and fitted, but the final model chosen is a ensemble model by stacking method:
Model Accuracy: 68%
Overall return without model: -20.62%
Overall return with model: -7.90%
Overall percent improvement: 84.04%
- Left: If the model predicted fully paid with 75% probability, this would be categorised to be in the 70% - 80% range
- Right: If the model predicted 75% probability, the average improvement is 18%
- With increasing probabilities, there is an increasing improvement on return until the 80% – 100% range. In this range, few to no loans are defaulted on so there is not much opportunity to improve returns