YelpRestaurantReview: A Java repository from Presciman

Factorial LDA Code
------------------------------------

Copyright (c) 2013 Michael Paul 
Johns Hopkins University
mpaul@cs.jhu.edu


Please cite the following paper in any work that uses this material:

@InProceedings{paul-dredze-flda-nips-2012,
  author    = {Paul, Michael J. and Dredze, Mark},
  title     = {Factorial LDA: Sparse Multi-Dimensional Text Models},
  booktitle = {Advances in Neural Information Processing Systems (NIPS 2012)},
  month     = {December},
  year      = {2012},
  url       = {http://books.nips.cc/papers/files/nips25/NIPS2012_1224.pdf}
}

The Factorial LDA Code is a free software; you can 
redistribute it and/or modify it under the terms of the GNU General Public 
License as published by the Free Software Foundation; either version 2 of the 
License, or (at your option) any later version.

The Markov Modeling Package is distributed in the hope that it will 
be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of 
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public 
License for more details.

You should have received a copy of the GNU General Public License along 
with this software; if not, write to the Free Software Foundation, Inc., 
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

================================================================================

1. Introduction
---------------

This software includes an implementation of Factorial LDA, described in the paper above. 
Please refer to the paper for an overview of the model. The command line 
parameters for this software will follow the notation from the paper.


2. Revision History
-------------------

v0.1:  August 18, 2013 - Initial release.


3. Installation
---------------

Straightforward Java compilation can be done with the following commands:

> tar -xzvf flda-0.1.tar.gz
> cd flda
> javac *.java


4. Usage
--------

To run the program, enter the command:

> java LearnTopicModel -model flda -input <input_file> -K <int> -Z <int> -Y <int> [-iters <int>] [<model-specific parameters>]

<input_file> is the filename of the input (format described in a later section).

The required parameter -K specifies the number of factors.
The required parameter -Z specifies the number of components of the first factor.
The required parameter -Y specifies the number of components of all other factors.

This implementation currently assumes all factors after the first have the same number
of components. If you need to specific different numbers, you'll need to modify 
LearnTopicModel.java

The optional parameter -iters specifies the total number of Gibbs sampling 
iterations to perform. If unspecified, this defaults to 2000.

The optional parameter -samples specifies the number of samples to collect
and store in the output file. If unspecified, this defaults to 100. The
samples are collected at the end of the Gibbs sampling run. For example, if
you set -iters to 5000 and -samples to 200, the sampler will run for a "burn in"
of 4800 iterations, and it will save the samples from the final 200.

The -model parameter MUST be "flda". While currently this code only
contains the fLDA model, I am eventually planning to integrate it with code
for my other models. 

Additional command-line parameters are described in 4.1 below.

When the program finishes, it writes the final variable assignments to the file
<input_file>.assign -- the output format is similar to the input format, except 
each word token has been appending with a :-separated list of the number of 
times each tuple value was sampled. 

The parameters theta, phi, etc. can be computed from this output
file. For convenience, a python script is included to print out the top
words (omega and phi) for the tuples. The script is used as:

> python topwords_flda.py 3 20 2 data/input.txt 100 > output_topwords.txt

The first parameter is -K, the second is -Z, the third is -Y, the fourth
is the filename of the sampler input file (NOT the output file to which
".assign" has been appended), and the fifth is the number of samples
that were collected (specified by the -samples parameter, default 100).


4.1. Parameters 
---------------

The optional command-line parameters for this model are:

[-sigmaA <double>]        The stddev for the alpha parameters. Default 1.0. 
[-sigmaAB <double>]       The stddev for the alpha^(B) parameter. Default 1.0. 
[-sigmaW <double>]        The stddev for the omega parameters. Default 0.5. 
[-sigmaWB <double>]       The stddev for the omega^(B) parameter. Default 10.0.
[-delta0 <double>]        The first parameter for the Beta sparsity prior. Default 0.1. 
[-delta1 <double>]        The second parameter for the Beta sparsity prior. Default 0.1. 
[-alphaB <double>]        The initial value of alpha^(B). Default -5.0. 
[-omegaB <double>]        The initial value of omega^(B). Default -5.0. 
[-stepSizeADZ <double>]   The gradient step size for the document-specific alpha^(d) parameters.
                          Default 1e-2. 
[-stepSizeAZ <double>]    The gradient step size for the corpus-wide alpha^(D) parameters. 
                          Default [stepSizeADZ]/100.0. 
[-stepSizeAB <double>]    The gradient step size for the alpha^(B) parameter. 
                          Default [stepSizeADZ]/100.0. 
[-stepSizeW <double>]     The gradient step size for the omega parameters. 
                          Default 1e-3. 
[-stepSizeWB <double>]    The gradient step size for the omega^(B) parameter. 
                          Default [stepSizeADZ]/100.0. 
[-stepSizeB <double>]     The gradient step size for the sparsity parameters. 
                          Default 1e-3. 
[-likelihoodFreq <int>]   The interval at which the corpus log-likelihood is computed and display.
                          It can take the following values:
                            -1: Never compute the log-likelihood (fastest)
                             0: Compute the log-likelihood after every sampling iteration (same as 1)
                             x: Compute the log-likelihood every x iterations; x > 0
                          Default 100.
[-blockFreq <int>]        EXPERIMENTAL. The interval at which a token is sampled a block (all tuples)
                          as opposed to sampling each factor's value independently. This is
                          explainined in 4.2. It can take the following values:
                            -1: Never sample as block (fastest, but worse mixing)
                             0: Always sample as block (same as 1)
                             x: Sample as block every x iterations; x > 0
                          Default 0 (always use block sampling).
[-priorPrefix <string>]   The prefix for filenames containing Gaussian prior means for omega.
                          This is explained in 4.3. If empty, all priors are assumed to be 0-mean.
                          Default "".

Example usage:

> java LearnTopicModel -model flda -input data/input.txt -K 3 -Z 20 -Y 2 -iters 5000 -sigmaW 0.1

4.2 The blockFreq parameter
---------------------------

A source of slowdown in f-LDA is that the sampler considers all possible tuple values
for each token. The number of possible tuples increases exponentially with the number
of factors and components. For example, if Z=<20,2,2> then there are 20*2*2=80 values
that the sampler enumerates.

Rather than sampling over all possible tuples, this implementation also supports the
ability to sample each factor independently, conditioned on the values of the other
factors. If this is done, then if Z=<20,2,2> then the sampler only needs to consider
20+2+2=24 possibilities, because we only consider the values within each factor,
with the other factors fixed to their previous values.

This is much faster because it the cost is additive rather than multiplicative.
However, the sampler may not mix as well because these are not considered jointly.
(We'll call the full sampler a "block" sampler.) The implementation can alternate
between the two approaches, performing the more expensive block sampler every
-blockFreq iterations. This will provide a large speedup, but I have not experimented
with this method, so you will need to experiment with this speed/accuracy tradeoff
if you are interested. 

4.3 The priorPrefix parameter
---------------------------

By default, the Gaussian priors over the omega variables have means of 0. You can 
specify files defining other mean values using the -priorPrefix parameter.

If you use this feature, you must create a set of files specifying values for the
priors for every tuple as well as the background weights, omega^0. The -priorPrefix
parameter specifies the beginning of the file paths for these files. As an example,
if you set -priorPrefix to "priors/weights" and you use Z=<3,2> (-K 2 -Z 3 -Y 2)
then the following files must exist:

priors/weights.txt
priors/weights0_0.txt
priors/weights0_1.txt
priors/weights0_2.txt
priors/weights1_0.txt
priors/weights1_1.txt

The priors/weights.txt file would contain values for the background distribution,
while each priors/weights{k}_{i}.txt would contain values for component i of
factor k.

The format of each file is one word per line, where each line is: <word> <mean value>

For example:

red 1.0
blue 2.5
green -0.5

Words that are not contained in the file will take the default value of 0.
A file can be completely empty if you do not want to specify prior values for
a particular tuple, although the file must still exist.

If you use this functionality, please also cite this paper:

@InProceedings{paul-dredze-drugs-naacl-2013,
  author    = {Paul, Michael J. and Dredze, Mark},
  title     = {Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models},
  booktitle = {North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013)},
  month     = {June},
  year      = {2013},
  url       = {http://www.aclweb.org/anthology/N/N13/N13-1017.pdf}
}



5. Input Format
---------------

The format of the input file is:

<doc_id> <doc_words (space-delimited)>

Example: 

0 this is a document 
1 this is another document 

The first column is an integer ID that is not used by the program, so you can
set this to whatever you want. The IDs do not have to be unique, but they must
be integer. These IDs will be saved in the output file, so you can use these to
identify the documents later.


6. Output Format
----------------

The output format is the same as the input format, except each word token
is appended with a colon-separated list of sample counts for each tuple.
For example, if you modeled 6 tuples and collected 100 samples, "word" in
the input file may be written as "word:0:50:0:0:30:20" in the output file.
This means that this token was assigned to tuple #2 50 times out of the 
100 samples, while 5 and 6 were sampled 30 and 20 times.

The output is written to a file of the same name as the input file, except
the filename is appended with ".assign".

The learned parameters and hyperparameters are also written to files with extensions
".alpha", ".omega", ".beta", etc. 


7. Viewing the Top Words
------------------------

Python scripts are included to print out the top words for the topics. The script 
takes a command line argument of the input file which was used by the Java program. 

Example usage:

> python topwords_flda.py 3 20 2 data/input.txt 100 > output_topwords.txt

The first parameter is -K, the second is -Z, the third is -Y, the fourth
is the filename of the sampler input file (NOT the output file to which
".assign" has been appended), and the fifth is the number of samples
that were collected (specified by the -samples parameter, default 100).

This shows the highest-weight words for each omega vector as well as the
most sampled words for each tuple. The omega weights for the background and each
factor > 0 are shown first. Then, each "topic" (factor 0) is shown; for each
topic the omega weights are shown first, then the sampler counts are shown
for each tuple that includes this topic.
Presciman/YelpRestaurantReview