Gal Chechik's MATLAB data (1987--2003) is INCORRECT. The data in 1987--1999 is from Sam Roweis, while the 2000-2003 data was processed by Gal. I use a single script (scripts/convert_docs.m) to extract text from the single .mat file that contains all documents from 1987--2003. Unfortunately something is wrong with the data from 2000--2003. If I take a file from 1987 (e.g., 0001) and do cat filename | uniq -c | sort -n then the word counts look fine. If I do the same thing for documents in 2000-2003, the word counts look wrong. For example, 2001/AA01 (the infinite HMM paper) has the following as its most probable words: 16 department 16 emotion 25 universality 28 introduction 28 unit 38 bayesian 44 codes 48 basic 62 ruyter This is clearly wrong. Fortunately, Xuerui Wang had Gal's raw text data, so I'm now using that instead (data/raw). I'm downloading 2004--2006 myself.