/kaggle-Rain

Winning solution to the Kaggle competition - How Much Did It Rain? II

Primary LanguagePythonMIT LicenseMIT

How Much Did It Rain? II

Kaggle competition winning solution

This document describes how to generate the winning solution to the Kaggle competition How Much Did It Rain? II.

Further documentation on the method can be found in this blog post.

Generating the solution

Install the dependencies

The models are written in Python 2.7 and makes use of the NumPy, scikit-learn, and pandas packages. These can be installed individually via pip or all together in a free Python distribution such as Anaconda.

Theano can be installed and configured to use any available NVIDIA GPUs by following the instructions here and here. The Lasagne package often requires the latest version of Theano; a simple pip install Theano may give a version that is out-of-date (see Lasagne documentation for details).

Lasagne can be installed by following the instructions here.

Download the code

To download the code run:

git clone git://github.com/simaaron/kaggle-Rain.git

Create an empty data folder

cd kaggle-Rain
mkdir data

Download the training and test data

The training and test data can be downloaded from the Kaggle competition webpage at this link. The two extracted files train.csv and test.csv should be placed in the data folder.

Note: the benchmark sample solution and code provided by Kaggle are not required.

Preprocess the data

Replace the NaN entries with zeros (training and test data) and remove the outliers (training data only) by running:

python data_preprocessing.py

This will also create three additional train, valid, and test folders. The size of the validation holdout subset and the outlier threshold expected rainfall value can be changed in the above Python script.

Augment the data sets with dropin copies

Create random augmentation copies of the datasets by running:

python data_augmentation_train.py
python data_augmentation_valid.py
python data_augmentation_test.py

This creates 61 randomly augmented copies of the preprocessed training and test data sets and one of the validation holdout set. Note that each copy is > 2GB in size. If there is an issue with insufficient hard disk space, one should modify the training script NNregression_*.py and test script NNprediction_*.py to perform these augmentations dynamically.

The number of copies can be changed in the above scripts.

Train the networks

The two best models can be trained by running:

python NNregression_v1.py -v=1
python NNregression_v2.py -v=2

The list of functions corresponding to the different models can be found in the Python script NN_architectures.py. The remaining models can be trained by simply modifying the corresponding function import and call within either script above and then saving and running a new script:

python NNregression_v*.py -v=*

The outputs from different models are continually saved into separate output folders. These include the files training_scores.txt and validation_scores.txt which, for monitoring purposes, give the evolution of the training and validation errors respectively. The file model.npz is the current best fitting set of model parameters (w.r.t. the validation holdout set), and the last_learn_rate.txt records the current (decayed) learning rate.

Generate predictions from augmented test sets

The set of 61 augmented test set predictions from the model 'v1' can be obtained by running:

for j in `seq 0 60`;
do
	python NNpredictor_v1.py -rd=$j
done

The predictions from the pre-trained model included in the code download can be obtained by running:

for j in `seq 0 60`;
do
	python NNpredictor_v1.py -rd=$j -i pretrained_model_v1.npz
done

Average the augmented predictions

The predictions from different augmented copies can be combined by running:

python ensembling.py -v=1 -nr=61

This averages the 61 predictions of the model 'v1' and saves it to the file ens_submission_v1_61ave_mean.csv.

The individual predictions from the models 'v1' and 'v2' would place one 2nd/3rd in the competition. A straight average of the two solutions would be sufficient for 1st place.