A Blending approach of Gradient Boosting (XGBoost) and Artificial Neural Network (keras) to the Kaggle contest - "Porto Seguro’s Safe Driver Prediction".
Chen-Hsi (Sky) Huang (https://github.com/skyhuang1208)
Louis Yang (https://github.com/louis925)
Luyao Zoe Xu (https://github.com/LuyaoXu)
Ming-Chang Chiu (https://github.com/charismaticchiu)
Silver medal, top 4% (164th out of 5169 teams)
- Train several "XGBoost" and "Neural network" models
In the training stages, some or all of the following techniques were used:- Log1p and cubic transform on right-skewed and left-skewed features
- Stratified K-fold cross validation
- Grid search for hyper-parameters
- Upsampling to enhance the rare positive cases
- Embedding neural network
- Blending - Linear combination of all predictions from different models (LCM)
- Cross validation: determine combination weights
- Coarse grid search followed by Monte Carlo fine search
- Probability vs rank blending: combine the result using the predicted probablities or rankings of the probablities
blend_prob_search.py
: input validation predictions, search for best weights w.r.t. Gini coeff. or log lossblend_prob_combine.py
: input predictions on test set and weights, output final submissionblend_rank_search.py
: rank combine version of blend_prob_searchblend_rank_combine.py
: rank combine version of blend_prob_combinemodel_xgboost_sky.py
: XGBoost model 1model_xgboost_luyao.ipynb
: XGBoost model 2model_nn_keras.ipynb
: Keras neural network modelmodel_nn_tf.ipynb
: TensorFlow neural network model (in development)model_random_forest.py
: sklearn random forest modelsplit_train_val.py
: split the training dataset intoLCM_train
andLCM_val
for blending weight search
- Download and upzip the
train.7z
andtest.7z
from Kaggle (https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data), and place them in theinput
folder. - Run
split_train_val.py
to split the training datasettrain.csv
into 3 sets ofLCM_train*.csv
andLCM_val*.csv
for blending weight search. - Train each models on
LCM_train*.csv
and predict on correspondingLCM_val*.csv
for each LCM set separately. These results are used for determining the blending weights. - Train each models on
train.csv
and predict ontest.csv
. These results will be blended together as the final prediction. - Run
blend_prob/rank_search.py
to find the best weights using the result onLCM_val*.csv
set from each model. - Run
blend_prob/rank_combine.py
to combine the predictions ontest.csv
from each model with the best weights found in previous step.