Kaggle Competition: Porto Seguro’s Safe Driver Prediction

A Blending approach of Gradient Boosting (XGBoost) and Artificial Neural Network (keras) to the Kaggle contest - "Porto Seguro’s Safe Driver Prediction".

Contributors

Chen-Hsi (Sky) Huang (https://github.com/skyhuang1208)
Louis Yang (https://github.com/louis925)
Luyao Zoe Xu (https://github.com/LuyaoXu)
Ming-Chang Chiu (https://github.com/charismaticchiu)

Achievement

Silver medal, top 4% (164th out of 5169 teams)

How It Works

Train several "XGBoost" and "Neural network" models
In the training stages, some or all of the following techniques were used:
- Log1p and cubic transform on right-skewed and left-skewed features
- Stratified K-fold cross validation
- Grid search for hyper-parameters
- Upsampling to enhance the rare positive cases
- Embedding neural network
Blending - Linear combination of all predictions from different models (LCM)
- Cross validation: determine combination weights
- Coarse grid search followed by Monte Carlo fine search
- Probability vs rank blending: combine the result using the predicted probablities or rankings of the probablities

Scripts

blend_prob_search.py: input validation predictions, search for best weights w.r.t. Gini coeff. or log loss
blend_prob_combine.py: input predictions on test set and weights, output final submission
blend_rank_search.py: rank combine version of blend_prob_search
blend_rank_combine.py: rank combine version of blend_prob_combine
model_xgboost_sky.py: XGBoost model 1
model_xgboost_luyao.ipynb: XGBoost model 2
model_nn_keras.ipynb: Keras neural network model
model_nn_tf.ipynb: TensorFlow neural network model (in development)
model_random_forest.py: sklearn random forest model
split_train_val.py: split the training dataset into LCM_train and LCM_val for blending weight search

How to Use

Download and upzip the train.7z and test.7z from Kaggle (https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data), and place them in the input folder.
Run split_train_val.py to split the training dataset train.csv into 3 sets of LCM_train*.csv and LCM_val*.csv for blending weight search.
Train each models on LCM_train*.csv and predict on corresponding LCM_val*.csv for each LCM set separately. These results are used for determining the blending weights.
Train each models on train.csv and predict on test.csv. These results will be blended together as the final prediction.
Run blend_prob/rank_search.py to find the best weights using the result on LCM_val*.csv set from each model.
Run blend_prob/rank_combine.py to combine the predictions on test.csv from each model with the best weights found in previous step.

skyhuang1208/kaggle-porto-safe-driver-prediction

Kaggle Competition: Porto Seguro’s Safe Driver Prediction

Contributors

Achievement

How It Works

Scripts

How to Use