This code helps to land the third place in Galaxy Zoo challenge. The convolutional part uses cuda-convnet library of Alex Krizhevsky. Please see the package first. The code was run on Windows 64bit machines. Please make sure python, numpy, scipy, PIL-image and matlab are available.
There are several main steps:
1. Split randomly the training data into model development (98%) and validation subsets (2%)
2. Train ConvNet on model development data, and extract features from the last hidden layer for all data
3. Train multiple neural networks using a small portion of development data and about half of validation data
4. Blend multiple neural networks using another neural network
5. Average several models from steps 2, 3 and 4. Models chosen are based on their performance on the public leaderboard.
The details are described as below:
Download images_training_rev1.zip, images_test_rev1.zip, training_solutions_rev1.zip, and central_pixel_benchmark.zip from the data page. Extract all archives, rename central_pixel_benchmark.csv to kaggle_submission.csv and store them in the ./raw_data/
directory in the following way:
raw_data/
images_training_rev1/
100008.jpg
...
999967.jpg
images_test_rev1/
100018.jpg
...
999996.jpg
training_solutions_rev1.csv
kaggle_submission.csv
From root directory, run the following command in cmd
:
python.exe make_batches.py 1 0 0 0
This produces 140 batches data (1-59: development; 60-61: validation; 62-140: testing):
RUN/
data/
data_batch_1
...
data_batch_140
batches.meta
Start training the network:
python.exe ./cuda_convnet/convnet.py --data-path=./RUN/data/ --save-path=./RUN/model/ --test-range=60-61 --train-range=1-59 --layer-def=./model_config/gz.cfg --layer-params=./model_config/gz_param.cfg --data-provider=kaggle-galaxy-zoo-128-cropped-x90rot-zoom-memory --test-freq=590 --test-one=0 --crop-border=4 --epochs=50 --max-filesize=100000
Go to the model directory ./RUN/model/
and grasp the directory name, e.g., ConvNet__2014-03-29_07.47.35, which contains saved checkpoints of the model. Then keep training to 140 epochs:
python.exe ./cuda_convnet/convnet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35 --epochs=140
Decay learning rates by 0.1 and run up to 166 epochs
python.exe ./cuda_convnet/convnet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35 --layer-params=./model_config/gz_param_1.cfg --test-freq=118 --epochs=166
Decay learning rates by 0.1 and run up to 174 epochs
python.exe ./cuda_convnet/convnet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35 --layer-params=./model_config/gz_param_2.cfg --epochs=174
Decay learning rates by 0.1 and run up to 182 epochs
python.exe ./cuda_convnet/convnet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35 --layer-params=./model_config/gz_param_3.cfg --epochs=182
The training phase will take about 31.5 hours on NVIDIA Tesla M2070 GPU.
Write predicted values for testing data (batch 62-140):
python.exe ./cuda_convnet/shownet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35/182.59 --data-path=./RUN/data/ --write-features=fc37 --feature-path=./RUN/model/ConvNet__2014-03-29_07.47.35_182.59/feat_62_140_fc37 --train-range=61 --test-range=62-140 --multiview-test=1
Extract features for the next step:
python.exe ./cuda_convnet/shownet.py -f ./RUN/model/ConvNet__2014-03-29_07.47.35/182.59 --data-path=./RUN/data/ --write-features=dropout2 --feature-path=./RUN/model/ConvNet__2014-03-29_07.47.35_182.59/feat_1_140_fc2048 --test-range=1-140 --multiview-test=1
python.exe ./write_mat.py ./RUN/model/ConvNet__2014-03-29_07.47.35_182.59/feat_1_140_fc2048 ./refine/convnet_features.mat
Write CSV result file to submit to kaggle.
python.exe ./write_result.py 62 140
The submission file - 0.07939.csv - is saved into: ./RUN/model/ConvNet__2014-03-29_07.47.35_182.59/feat_62_140_fc37/0.07939.csv
.
This step and the next are run in matlab (tested on version 2013)
3a. Load the features generated from the ConvNet:
load('./refine/convnet_features.mat')
3b. Run refine/refine.m
several times to create multiple models, see refine/models.txt
for details
Run refine/blend.m
to blend multiple models learned in step 3b
Finally, average 4 results in directory ./RUN/avg_res/
to have the final result for submission ./RUN/avg_res/final_submission.csv
:
python.exe avg_result.py
The final submission achieved 0.07869 on the private scoreboard.