Script for Kaggle's Liberty Challenge contest.
Instructions for generating submission files:
-
Download the .csv-files
sampleSubmission.csv
,test.csv
andtrain.csv
into thedata
-folder of this repository. -
Run the script
train.py
- If the
build_features
-flag is manually set toFalse
, the script reads thefeats
-dictionary from thefeatures
-folder and will finish faster. - Otherwise, the features are built up from scratch.
- If the
-
The
train.py
-script will generate the submission filesgbm_sub.csv
andmean_sub.csv
in thesubmissions
folder.gbm_sub.csv
scores 0.32160 on the private leaderboard andmean_sub.csv
scores 0.32464 (i.e. 4th place for both)
-
Most of the computation time is spent training the gb-regressor.
- On an i7-4790, training the randomized lasso feature selector takes around 10 minutes, training the L1-feature selectors take around 20 minutes, and training the gb-regressor take about 70 minutes.
- The script uses a lot of RAM in in preprocessing.
- Also during the training of the randomized lasso feature selector, RAM usage occasionally hits 100% on a 16 GB machine, along with periodical sharp rises in CPU temps. This RAM usage can be lowered by reducing the n_jobs-parameter of the randomized lasso routine.
-
It should be noted that these submissions are not the ones I submitted to the contest. I moved towards more complex models which performed better on the public leaderboard but much worse on the private leaderboard.
- I figured that this model was interesting as it did quite well on the private leaderboard while being very simple.
- The 4th place finish can be achieved with one single gradient boosting regressor fitted on a selection of roughly 20 features.