/MScCapstone

UCD Data and Computational Science - Final Project

Primary LanguageHTML

Comparison and Evaluation of Differentmachine Learning Methods at Predicting Credit Card Default

πŸ“˜ Overview

The aim of this project is to apply a range of machine learning techniques to predict credit card default using the historical data of credit card customers. The following report describes the process undertaken to compare a number of models. Beginning with an industrial size dataset, cleansing and formatting the data before using it to train, tune and evaluate the chosen models (Neural Network, Random Forest and Logistic Regression) based on the historical data of credit card customers.

πŸ“‚ Files

Code from Repository alt text

Datasets from Google Drive alt text

❗ Requirements

Each of the software packages below must be installed a prerequisite to viewing the models

R / R-Studio

R-Studio is required to be installed in order to review the codebase.
Installation guide: https://rstudio-education.github.io/hopr/starting.html

Tensorflow

Tensorflow and keras are required to be installed to run the Neural Network Model.
Installation guide: https://keras.rstudio.com/install/index.html

πŸ”§ Installation

Once all files have been downloaded, root/R/main.R must be configured in order to specify path variables.

  • PATH_WD - Root of working directory from Github clone
  • PATH_DB - Root of database directory downloaded from Google Drive

NB ❗

  • paths must end in a β€œ/” separators (e.g. /Users/root1/Documents/CreditCardDefault/)
  • Windows directories must have either β€œ\\” or β€œ/” separators

Once all fields have been filled, run main.R to load all required files into memory.
From here all functions are available to call, see code comments per script for more details.

πŸ“ Structure

Github

CreditCardDefault
β”‚   README.md
β”‚
└─── cache
β”‚
└─── paper
β”‚
└─── plots
β”‚
└─── R
β”‚   β”‚   amex_metric.R
β”‚   β”‚   cleansing.R
β”‚   β”‚   database.R
β”‚   β”‚   DNN.R
β”‚   β”‚   EDAPlots.R
β”‚   β”‚   logisticP2.R
β”‚   β”‚   main.R
β”‚   β”‚   NeuralNetwork.R
β”‚   β”‚   Noise.R
β”‚   β”‚   readme
β”‚   β”‚   rf_logreg.R

Google Drive

Data store for original CSV files and processed parquet files.
Results available here including tuned models stored in RDS files.

**NB ** ❗ Google Drive may rename directories and split into multiple downloads, please ensure to correctly assemble the database as shown below

CreditCardDefault_Database
β”‚
└─── csv
β”‚   β”‚   train_data.csv
β”‚   β”‚   train_labels.csv
└─── parquet
β”‚   β”‚   data_lastPerCustomerID.parquet
β”‚   β”‚   train_data.parquet
└─── results
β”‚   β”‚   NN_tuningRuns

πŸ’Ύ Scripts

main.R

Primary file used to load all scripts.
See installation instructions above.

Feature Engineering

Feature engineering functions are stored within 3 files:

  • database.R
    • Manages reading / writing to disk
  • cleansing.R
    • Functions to determine NA / Coloration / Covarience thresholds
  • noise.R
    • Functions to remove injected noise in features.

Structure 1

Machine Learning Models

4 Models are evaluated as part of the project

  • Logistic Regression
  • Random Forest
    • rf_logreg.R
  • Logistic Regression (Subset on the single P_2 feature)
    • logisticP2.R
  • Neural Network
    • NeuralNetwork.R
      • Wrapper to run neural network
    • DNN.R
      • Neural Network Model
    • NN_tuningResults.R
      • Functions to analyse neural network tuning results, and train network on best parameters

Auxiliary Scripts

  • amex_metric.R
    • Function to compute competition metric (Normalized Gini Coefficient)
  • EDAPlots.R
    • Functions to plot exploritory data analysis plots

🚦 Run Times

Because of the large amount of data involved (even after feature engineering) the models take a significant length of time to run.
Tuned models are available for inspection in the Google Drive under results.

Model Script Tuning Length
Logistic Regression (P2) logisticP2.R 🟩
Logistic Regression rf_logreg.R 🟨
Neural Network NeuralNetwork.R πŸŸ₯
Random Forest rf_logreg.R πŸŸ₯

πŸ‘¬ Contributors

Sidney Harshman-Earley sidney.harshman-earley@ucdconnect.ie
Denis O’Riordan denis.oriordan1@ucdconnect.ie