/SuperTopicModels

Supervised Latent Dirichlet Allocation and other topic models. Supports regression and classification. Written in Matlab.

Primary LanguageMatlabOtherNOASSERTION

SuperTopicModels : toolbox for using LDA and sLDA in Matlab for regression/classification
Website: http://michaelchughes.github.com/
Author:  Mike Hughes (www.michaelchughes.com)
Please email all comments/questions to mike <AT> michaelchughes.com

This toolbox provides code for running Markov chain Monte Carlo (MCMC) posterior
inference for a variety of topic models, including latent Dirichlet allocation (LDA) and *supervised* latent Dirichlet allocation (sLDA).    

QUICK START

The repository is organized as follows:  
  code/ contains relevant Matlab code. This should be the working dir in Matlab.
    code/demo/  contains heavily-commented intro scripts for how to do:
       (1) basic unsupervised LDA training (EasyDemo)
       (2) LDA + regression prediction (DemoRegression_LDA)
          sLDA + regression prediction (DemoRegressoin_sLDA)
       (3) LDA + binary classification (DemoBinaryClassifier_LDA)
          sLDA + binary classification (DemoBinaryClassifier_sLDA)

Most core sampling routines are implemented as fast MEX C++ code.  
  To compile these, simply run ConfigToolbox.m first.  Then run any demos.

DATA FORMAT
Entire dataset is a big struct array, where each document "d" has
  words (as a vector of unique term ids) 
        Data(d).words
  target variable (either real-valued, binary, or one-of-K)
        Data(d).y

To understand the data, inspect the "TrainData" and "TestData" variables left in your workspace after any of the demos.
Alternatively, see the genSynthData* functions, which build toy datasets.

To use your own dataset, you can create the structs and save them as "MyCustomTrainData.mat" and "MyCustomTestData.mat".
  Then, simply call runLDA or runSuperLDA with {'/full/path/to/MyCustomTrainData.mat'} as first argument.
Of course, you can name the file whatever you want, not just "MyCustom...".

DEPENDENCIES
  The toolbox works stand-alone for both regression and binary classification.
  For multi-class classification, the fast MEX truncated normal sampling routines in code/rndgen/
     require both the Eigen and Boost C++ libraries.
  Create environment variables that point to the install directories, called
     EIGENPATH and BOOSTPATH
  In Matlab, this can be done simply by running
     setenv( 'EIGENPATH', '/path/to/eigen/on/my/system/');
  Alternatively, there is native matlab code included, but it is very very slow.

              
Look for additional documentation and occasional updates on github:
   https://github.com/michaelchughes/NPBayesHMM/

This software is released under the Simple Public License 2.0, a permissive, 
copyleft license.  Please see the LICENSE file for details.