/LogicLDA

Topic modeling with first-order logic (FOL) domain knowledge

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

LogicLDA - Topic modeling with First-Order Logic (FOL) domain knowledge

David Andrzejewski (andrzeje@cs.wisc.edu)
Department of Computer Sciences
University of Wisconsin-Madison, USA

OVERVIEW

This code implements inference for the LogicLDA model [1].  LogicLDA
extends Latent Dirichlet Allocation (LDA) [2] by allowing the user to
specify a weighted first-order logic (FOL) knowledge base (KB), as in
Markov Logic Networks (MLN).  The inferred topics will then be
influenced by both document-word corpus statistics as well as the
user-specified logical KB. This code is implemented in Java and
performs scalable MAP inference via a stochastic gradient descent
scheme.

BUILD

The LogicLDA Java project can be built with Maven:

$ mvn package
$ cp ./target/logiclda-0.0.1-SNAPSHOT-jar-with-dependencies.jar ./logiclda.jar


USAGE

Say that our dataset is named 'nyt' (see INPUT/OUTPUT FILES below).
Then we run LogicLDA inference with:

java -jar logiclda.jar nyt 500 100 10000 25 194582

Which does
500 iterations of LogicLDA collapsed Gibbs sampling to initialize
100 outer / 10000 inner iterations of Mir SGD inference
print out the Top 25 words for each topic to nyt.topics
using 194582 as the random number seed

An example dataset and bash script can be found in ./test


INPUT/OUTPUT FILES

The input and output files obey the a naming convention where all
files consist of the name of the dataset plus a filetype-specific
extesion.  For example, if we were processing a corpus of New York
Times articles, we would supply input files nyt.docs, nyt.vocab, ...
and LogicLDA would give us output files nyt.topics, nyt.sample, ...

Note that *.words should be the integer indices of word tokens, not
the tokens themselves.  For example, if *.vocab is 

foo
bar
baz

Then the string "foo foo baz bar foo" should be in *.words as: 0 0 2 1 0

These kinds of representations can be built from plaintext with the
textproc code (https://github.com/davidandrzej/textproc).

[input]
        .rules          FOL rules (see RULES for details)
        .alpha          [T] alpha hyperparameter
        .beta           [TxW] beta hyperparameter
        .doclist        [D] document names
        .vocab          [W] vocabulary (one word per line)
        .words          [N] word indices for each corpus position  
        .docs           [N] document indices for each corpus position

        (optional)
        .sent           [N] sentence indices for each corpus position 
        .init           [N] initial z-sample state

[output]
        .sample         [N] latent topic indices for each corpus position
        .phi            [TxW] topic-word probabilities P(w|z)
        .theta          [DxT] document-topic probabilities P(z|d)
        .topics         plaintext summary of learned topics
        .logic          plaintext summary of logic rule satisfaction


EXTENDING LOGICLDA

This software is designed to allow the straightforward inclusion of
custom rule types and sources of side information (see EXTENDING for
details).


LICENSE

This software is open-source, released under the terms of the GNU
General Public License version 3, or any later version of the GPL (see
COPYING).


REFERENCES

[1] Andrzejewski, D., Zhu, X., Craven, M., and Recht, B. (2011).  A
Framework for Incorporating General Domain Knowledge into Latent
Dirichlet Allocation Using First-Order Logic. In Proceedings of the
22nd International Joint Conference on Artificial Intelligence (IJCAI
2011).

[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet
Allocation.  Journal of Machine Learning Research (JMLR) 3
(Mar. 2003), 993-1022.

[3] Domingos, P. and Lowd, D. (2009).  Markov Logic: An Interface
Layer for Artificial Intelligence. Synthesis Lectures on Artificial
Intelligence and Machine Learning, Morgan and Claypool Publishers.