/Home-Credit-Default-Risk-Recognition

The project provides a complete end-to-end workflow for building a binary classifier in Python to recognize the risk of housing loan default. It includes methods like automated feature engineering for connecting relational databases, comparison of different classifiers on imbalanced data, and hyperparameter tuning using Bayesian optimization.

Primary LanguageJupyter NotebookMIT LicenseMIT

Capstone Project - Machine Learning Engineer Nanodegree - Udacity

Home Credit Default Risk Recognition

-By Abhishek Bihani

Based on the Kaggle Competition


Domain Background

An important fraction of the population finds it difficult to get their home loans approved due to insufficient or absent credit history. This prevents them to buy their own dream homes and at times even forces them to rely on other sources of money which may be unreliable and have exorbitant interest rates. Conversely, it is a major challenge for banks and other finance lending agencies to decide for which candidates to approve housing loans. The credit history is not always a sufficient tool for decisions, since it is possible that those borrowers with a long credit history can still default on the loan and some people with a good chance of loan repayment may simply not have a sufficiently long credit history.

A number of recent researchers have applied machine learning to predict the loan default risk. This is important since a machine learning-based classification tool to predict the loan default risk which uses more features than just the traditional credit history can be of great help for both, potential borrowers, and the lending institutions.


Problem Statement

The problem and associated data has been provided by Home Call Credit Group for a Kaggle competition. The problem can be described as, “A binary classification problem where the inputs are various features describing the financial and behavioral history of the loan applicants, in order to predict whether the loan will be repaid or defaulted.”


Project Novelty

The notebook provides a complete end-to-end machine learning workflow for building a binary classifier, and includes methods like automated feature engineering for connecting relational databases, comparison of different classifiers on imbalanced data, and hyperparameter tuning using Bayesian optimization.

ROC AUC comparison

Figure 1- ROC Curve and AUC Comparison of Different Classifiers

Datasets and Inputs

The dataset files are provided on the Kaggle website in the form of multiple CSV files and are free to download. The dataset files are described as per Figure 2.

image

Figure 2- Description and connectivity of the Home Credit Default Risk dataset

As seen in Figure 2, the file application_{train|test}.csv contains the main table containing the training dataset (307511 samples) and test dataset (48744 samples), with each row representing one loan identified by the feature SK_ID_CURR. The training set contains the variable TARGET with binary values (0: the loan was repaid or 1: the loan was not repaid). There are many input files available, which can be analysed for input features to train the model. The large number of input features and training samples make it easier to identify the important factors and for constructing a credit default risk classification model.


Project Design and Solution

The project has been divided into five parts-

  1. Data Preparation - Before starting the modeling, we need to import the necessary libraries and the datasets. If there are more than one files, then all need to be imported before we can look at the feature types and number of rows/columns in each file.

  2. Exploratory Data Analysis - After data importing, we can investigate the data and answer questions like- How many features are present and how are they interlinked? What is the data quality, are there missing values? What are the different data types, are there many categorical features? Is the data imbalanced? And most importantly, are there any obvious patterns between the predictor and response features?

  3. Feature Engineering - After exploring the data distributions, we can conduct feature engineering to prepare the data for model training. This includes operations like replacing outliers, imputing missing values, one-hot encoding categorical variables, and rescaling the data. Since there are number of relational databases, we can use extract, transform, load (ETL) processes using automated feature Engineering with Featuretools to connect the datasets. The additional features from these datasets will help improve the results over the base case (logistic regression).

  4. Classifier Models: Training, Prediction and Comparison - After the dataset is split into training and testing sets, we can correct the data imbalances by undersampling the majority class. Then, we can training the different classifier models (Logistic Regression, Random Forest, Decision Tree, Gaussian Naive Bayes, XGBoost, Gradient Boosting, LightGBM) and compare their performance on the test data using metrics like accuracy, F1-score and ROC AUC. After choosing the best classifier, we can use K-fold cross validation to select the best model. This will help us choose parameters that correspond to the best performance without creating a separate validation dataset.

  5. Hyperparameter Tuning - After choosing the binary classifier, we can tune the hyperparameters for improving the model results through grid search, random search, and Bayesian optimization (Hypertopt library). The hyperparameter tuning process will use an objective function on the given domain space, and an optimization algorithm to give the results. The ROC AUC validation scores from all three methods for different iterations can be compared to see trends.


Package/Library Requirements

The following packages need to be installed for running the project notebook.

  1. sklearn - For models and metrics
  2. warnings - For preventing warnings
  3. numpy - For basic matrix handling
  4. matplotlib - For figure plotting
  5. pandas - For creating dataframes
  6. seaborn - For figure plotting
  7. timeit - For tracking times
  8. os - for setting work directory
  9. random - For creating random seeds
  10. csv - For saving csv files
  11. json - For creating json files
  12. itertools - For creating iterators for efficient looping
  13. pprint - For pretty printing data structures
  14. pydash - for doing “stuff” in a functional way (utility library).
  15. gc - Garbage collector for deleting data
  16. re - Raw string notation for regular expression patterns
  17. featuretools - Automated feature engineering
  18. xgboost - XGBoost model
  19. lightgbm - LightGBM model
  20. hyperopt - Bayesian hyperparameter optimization

Note - The packages can be installed by uncommenting the first cell in the project notebook

References / Acknowledgements

This project builds on scripts and explanations from other Jupyter notebooks publicly shared on Kaggle. The list is as follows-

  1. A Gentle Introduction
  2. Introduction to Automated Feature Engineering
  3. Advanced Automated Feature Engineering
  4. Intro to Model Tuning: Grid and Random Search
  5. Automated Model Tuning
  6. Home Credit Default Risk Extensive EDA
  7. Home Credit Default Risk: Visualization & Analysis
  8. Loan repayers v/s Loan defaulters - HOME CREDIT