Top-4% solution to the Home Credit Default Risk Kaggle competition on credit scoring.
In finance, credit scoring refers to the use of statistical models to guide loan approval decisions. This project develops a binary classification model to distinguish defaulters and non-defaulters using supervised machine learning.
The project works with data from multiple sources, including credit bureau information, application data, performance on previous loans and credit card balance. I perform thorough feature engineering and aggregate data into a single high-dimensional data set. Next, I train an ensemble of LightGBM models that predict the probability of default.
The project has the following structure:
codes/
: Python notebooks with codes for data preparation, modeling and ensemblingdata/
: input data (not included due to size constraints and can be downloaded here)output/
: output figures exported from the notebookssolutions/
: slides with competition solutions from other competitorssubmissions/
: test set predictions produced by the trained models
There are three notebooks:
code_1_data_prep.ipynb
: processing of the raw data, feature enginering and export of the aggregated data setcode_2_modeling.ipynb
: training LightGBM models to predict credit risk and export of the test set predictionscode_3_ensemble.ipynb
: ensembling predictions from different LightGBM models trained incode_2_modeling.ipynb
More details are provided within the notebooks documentation.
To run the project codes, you can create a new virtual environment in conda
:
conda create -n py3 python=3.7
conda activate py3
and then install the requirements:
conda install -n py3 --yes --file requirements.txt
pip install lightgbm
pip install imblearn