Objective of the challenge is to predict which borrowers will experience financial distress in the next two years.
For this challenge, I covered the following steps:
- EDA
- Training initial model (with default parameters)
- Assessing model
- Submitting predictions using initial model
- Optimizing model to reach higher score in Kaggle Leaderboard
- Submitting predictions using optimal model
Current repository in inspired from https://drivendata.github.io/cookiecutter-data-science/
/data:
/output -> Predictions
/raw -> Raw data (training and test datasets)
/models:
*.pkl -> Pickled trained models / Hyperopt trials
/notebooks
EDA.ipynb -> Use for EDA
assess.ipynb -> Use for assessing model performance
/src
train.py --> Train model
predict.py --> Predict model
optimize.py --> Run Hyperopt with TPE
Makefile -> Use to simplify pipeline execution
requirements.txt -> Packages to install
setup.py -> Use to install current package
If you would like to get started ASAP, run these make commands in the following order:
make venv
--> Set-up python virtual environment
make train_def
--> Train initial model
make predict_def
--> Predict using initial model
make optimize
--> Run Hyperparameter optimizer (Hyperopt)
make train_opt
--> Train model with optimized parameters
make predict_opt
--> Predict using optimized model
Run the following command:
make venv
It will install all necessary packages used in this challenge.
EDA is available in notebook notebooks/EDA.ipynb
.
In this notebook, we are looking there at any missing data in the dataset, what feature type is and what its distribution is. We also looking at correlation between our target class and the different features.
First model is a XGBoost model.
This model was chosen as it yields good results without much data transformation (such as normalization, clipping etc...) required. Its parameters can be found in params/def_xgb_model.json
.
{
"name": "def_xgb_model",
"params": {
"max_depth": 4,
"n_estimators": 100,
"learning_rate": 0.05,
"n_jobs": -1,
"objective": "binary:logistic",
"colsample_bytree": 0.5,
"gamma": 1
}
}
We use 90% of the data for training and the remaining 10% for validation to assess that model doesn't overfit and generalizes well to new data.
Run either:
make train_def
Or:
python3 train.py --model_json ../params/def_xgb_model.json --split_ratio 0.9
You can now then use the model for prediction.
Run either:
make predict_def
Or:
python3 predict.py --model_json ../params/def_xgb_model.json
Model is assessed in notebook notebooks/assess.ipynb
.
In this notebook, we are investigating different plots and metrics to assess model performance. We are also looking at feature importance for model interpretability.
Private Score: 0.86645 (150th)
Public Score: 0.85998 (211th)
To reach a higher score, I considered two options:
- Hyperparameter tuning
- Ensemble / Stacking ensemble
Because of time constraints, I opted for the first option.
I used hyperopt
package to find a set of XGBoost parameters such as model mean validation AUC over K-Folds is maximized.
TPE algorithm is picked to search the space for the best parameters.
Search history is dumped at every round in models/opt_trials.pkl
so that optimizer can be stopped and resumed anytime.
Run either:
make optimize
Or:
python3 optimize.py
Optimizer may take a while to run.
Best set of parameters retrieved from after 80 optimization rounds is the following one:
{
"name": "opt_xgb_model",
"params": {
"booster": "gbtree",
"colsample_bytree": 0.9,
"eta": 0.034,
"gamma": 0.5,
"max_depth": 4,
"min_child_weight": 6.0,
"n_estimators": 290,
"subsample": 0.56
}
}
Private Score: 0.86817 (45th)
Public Score: 0.86169 (80th)