
Code for generating a data set of NIPS papers from 1987--2006.

Primary LanguagePerl

Gal Chechik's MATLAB data (1987--2003) is INCORRECT. The data in
1987--1999 is from Sam Roweis, while the 2000-2003 data was processed
by Gal. I use a single script (scripts/convert_docs.m) to extract text
from the single .mat file that contains all documents from
1987--2003. Unfortunately something is wrong with the data from
2000--2003. If I take a file from 1987 (e.g., 0001) and do

cat filename | uniq -c | sort -n

then the word counts look fine. If I do the same thing for documents
in 2000-2003, the word counts look wrong. For example, 2001/AA01 (the
infinite HMM paper) has the following as its most probable words:

     16 department
     16 emotion
     25 universality
     28 introduction
     28 unit
     38 bayesian
     44 codes
     48 basic
     62 ruyter

This is clearly wrong. Fortunately, Xuerui Wang had Gal's raw text
data, so I'm now using that instead (data/raw).

I'm downloading 2004--2006 myself.