kaggle-2014-criteo: A C++ repository from VinGorilla

3 Idiots' Approach for Display Advertising Challenge
====================================================

This README introduces how to run our code up. For the introduction to our
approach, please see 

    http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf  


IMPORTANT
=========
We keep improving the qualities of our codes and documents. If the only thing
you want to do is to reproduce our submission, please download v1.0 from:

    https://github.com/guestwalk/kaggle-2014-criteo/releases

We tested this commit for several times. Except this commit, we do not
guarantee the submission can be reproduced.


System Requirement
==================

- 64-bit Unix-like operating system (We tested our code on Ubuntu 13.10)

- Python3

- g++ (with C++11 and OpenMP support)

- at least 20GB memory and 50GB disk space


Get The Dataset
===============

Criteo released the data set after the end of the competition. However, the
format of the released data set is different from the one used in the
competition; the format used in the comptition is in csv format, while the
released one is in text format.

Our scripts require csv format. If you already have the dataset used in the
competition, please skip to the next section. Otherwise, we provide a script to
convert the files in text format to csv format. Please follow these steps:

1. Download the data set from Criteo.
    
    http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset

2. Decompress and make sure the files are correct.
    
    $ md5sum dac.tar.gz
    df9b1b3766d9ff91d5ca3eb3d23bed27  dac.tar.gz

    $ tar -xzf dac.tar.gz

    $ md5sum train.txt test.txt
    4dcfe6c4b7783585d4ae3c714994f26a  train.txt
    94ccf2787a67fd3d6e78a62129af0ed9  test.txt


3. Use ``txt2csv.py'' to convert training and test data to csv format.
    
    $ converters/txt2csv.py tr train.txt train_.csv

    $ converters/txt2csv.py te test.txt test_.csv

4. Calculate the checksum of produced file.
    
    $ md5sum train_.csv test_.csv
    ebf87fe3daa7d729e5c9302947050f41  train_.csv
    8016f59e45abb37ae7f6e7956f30e052  test_.csv

Now you can follow the steps in ``Step-by-step'' section to run our experiments.


Step-by-step
============

This section is divided into two parts. The first part introduces how to run
our code with a toy example. After you finish this part, please follow the
second part to run with the whole dataset.


Tiny Example
------------

1. Create two symbolic links.

     $ ln -s train.tiny.csv train.csv
     $ ln -s test.tiny.csv test.csv

2. Compile excutables, prepare files, and scan necessary information from the
   training data.

     $ make

3. Generate the prediction file "submission.csv".

     $ ./run.py

4. Generate the checksum for submission.csv.

     $ md5sum submission.csv
     d29c2f9e846a0ea77a6f56e983316360  submission.csv

   If you see the above checksum, congratulations!! Please run the following
   command and move to the next sub-section.

     $ make clean
     $ rm -f train.csv test.csv


The Whole Dataset
-----------------

1. Copy or link the training and test data to this directory. Their name should 
   be "train.csv" and "test.csv", respectively. Do the checksum to make sure
   they are correct.

     $ md5sum train.csv test.csv
     ebf87fe3daa7d729e5c9302947050f41  train.csv
     8016f59e45abb37ae7f6e7956f30e052  test.csv

2. Compile excutables, prepare files, and scan necessary information from the
   training data. These step takes around an hour.
    
     $ make

3. Generate the prediction file "submission.csv". (You may want to use more
   threads. Please see "Miscellaneous 1".)

     $ run.py


Miscellaneous
=============

1. By default we use only one thread, so it may take a long time to train the
   model. If you have multi-core CPUs, you may want to set NR_THREAD in run.py
   to use all cores. On our machine with two six-core CPUs (intel E5-2620), it
   takes around 3.5 hours when all cores are used.

2. Our algorithms is non-deterministic when multiple threads are used. (That
   is, the results can be slightly different when you run the script two or
   more times.) In our experience, the variances generally do not exceed
   0.0001 (LogLoss).

3. This script generates a prediction with around 0.44490/0.44480 on 
   public/private leaderboards. If you want the prediction with around 
   0.44460/0.44450, please change the following setting in run.py.

     ./gbdt -t 30        --->         ./gbdt -t 50  

     ./fm -k 4 -t 11     --->         ./fm -k 8 -t 15

   Training with this setting takes around 8 hours on our machine.

4. If you have any question, please send your email to:

     guestwalk@gmail.com (Yu-Chin's email)
VinGorilla/kaggle-2014-criteo