/Loan-Defaults

Investigating Factors that Lead to Loan Defaults.

Primary LanguageJupyter Notebook

Loan Defaults Modelling


In this project, the aim was to develop a model that would help to identify borrowers who were the most likely to default on their loans.

Preprocessing and Exploratory Data Analysis results can be viewed in the eda.ipynb notebook, and the training and validation of different models can be found in the ml_models.ipynb notebook.

These notebooks contain clear documentation of my thought process as i was working through the project.

Note: The functions used in this notebook can be found in the scripts within the utils folder.

Machine Learning Models

  • NearestCentroid
  • K-Nearest Neighbours
  • Logistic Regression
  • Random Forest Classifier
  • Gradient Boosting Classifier

Resampling Techniques

From our exploratory data analysis, we note that we are working with an imbalanced dataset. Therefore, we will be applying different resampling techniques and monitor if there are improvements to the model's performance.

  1. Random Over Sampling
  2. Synthetic Minority Oversampling Technique

Data Preparation and Evaluation Metric

The dataset was split into training (80%) and holdout (20%) datasets.

The individual models were trained using 5-Fold cross validation on the training data before being evaluated on the holdout data.

The primary metric used to evaluate model performance is the F1 Score, which is the harmonic mean between precision and recall. The rationale for choosing F1 is that we want to identify as many defaults as possible while ensuring that our model is not over predicting the defaults.

Additionally, we will provide insgihts about the model performances in relation to its precision and recall and how different use cases may favour different models.