/Machine-March-Madness

Code for predicting the NCAA College Basketball 'March Madness' tournament

Primary LanguagePython

DEPENDENCIES:
theano: http://deeplearning.net/software/theano/
numpy
scipy
(optional) matplotlib: http://matplotlib.sourceforge.net/

Note: maccam912 reports that the link above is currently down (Feb 28, 2012) but says, "I see you can still get it with 'easy_install Theano' or downloading it from http://pypi.python.org/pypi/Theano or the repo on github. Just a heads up. I don't use git or github often so I wasn't sure if the issues section was the place to point this out or not."


DATA:
Right now, only the aggregate data is really being used.  To check that 
you can load the data properly, run

> python march_madness_data.py

this should output the following:
Skipped 1426 entries due to UNK
After loading simple data
2006-2007: 5125 games
2007-2008: 5248 games
2008-2009: 5332 games
2009-2010: 5363 games
After removing tournament games
2006-2007: 5047 games
2007-2008: 5161 games
2008-2009: 5237 games
2009-2010: 5269 games

So it loaded about 5000 games from each of 4 past seasons.  

If you'd like to dig into the full data, look for 

def load_full_data(self):

in march_madness_data.py.


BRACKET:
We don't have data that specifically identifies which games were a part 
of the tournament, so we do it programmatically.  Most of the code is
called automatically when you make a MarchMadnessData object.  To see the
results, you can run

> python bracket.py

This should output the filled-in tournament bracket for previous seasons.
It should look like this:

2008-2009
nav---nav---nav---nav---nav---nav
raa     |     |     |     |     |   
lav---lav     |     |     |     |   
bav           |     |     |     |   
              |     |     |     |   
gaj---gaj---gaj     |     |     |   
aac     |           |     |     |   
wao---wao           |     |     |   
iae                 |     |     |   
                    |     |     |   
oae---oae---oae---oae     |     |   
mbq     |     |           |     |   
max---max     |           |     |   
cbg           |           |     |   
              |           |     |   
sci---sci---sci           |     |   
scc     |                 |     |   
aar---aar                 |     |   
tad                       |     |   

...

The three-letter code mappings to team names are in ./data/YahooTeamCodeMapping.csv.


LEARNING:
There is also starter code for learning, but this is still in progress.

The simplest thing to try is to run

> python learn_synthetic.py

This will run learning with the simplest model, on synthetic data.  The first run
will take a bit longer at startup, because theano is doing the symbolic differentiation.


You can then move on to

> python learn_real.py

This will learn, but on the real data now.  This code is not finished, but it should be
enough structure to get you started.

In model.py, you can see three different models, of increasing level of complexity.  You
can select between these in the learn_*.py scripts.


Some TODOs for the ambitious:

- Load the full data and verify it against the aggregate data.

- Set up a proper validation/testing framework, so we can evaluate different methods
  properly.  Perhaps we want to do leave-one-out cross validation.

- Try different objective functions in the theano models -- what do we actually
  want to optimize?

- Think about how to better include pace of the game

- Improve the optimization (maybe using momentum, LBFGS, or conjugate gradients?)