---------------------------------------------------------------------- ----------------------------- README --------------------------------- ---------------------------------------------------------------------- Code and data for Krafft et al. "Topic-Partitioned Multinetwork Embeddings", NIPS 2012 If you use this code or data in a paper, please use the following citation: @incollection{NIPS2012_1288, title ={Topic-Partitioned Multinetwork Embeddings}, author={Peter Krafft and Juston Moore and Bruce Desmarais and Hanna Wallach}, booktitle = {Advances in Neural Information Processing Systems 25}, editor = {P. Bartlett and F.C.N. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger}, pages = {2816--2824}, year = {2012}, url = {http://books.nips.cc/papers/files/nips25/NIPS2012_1288.pdf} } The programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk. ------------------------------------ ----------- QUICKSTART ------------- ------------------------------------ This quickstart section provides an example of how to train our model using this package. All commands should be run from this top directory. --- data format --- Our main model requires three data files: one that contains the words of each email, one that contains the recipients of each email, and one that contains the vocabulary used in the emails. The format of these files is assumed to be as follows: word matrix - each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - each subsequent column should contain a nonnegative number indicating the number of times the word type associated with that column occurs in that document (i.e. a vector of word counts corresponding to the word types given in the vocab folder). edge matrix - each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - the second column gives an index between zero and the number of actors in the email network minus one (inclusive) indicating the author of that email - there is one additional column for each actor in the email network. Each column should contain either a one (indicating that the actor is a recipient of that row's email) or a zero (indicating that the actor is not a recipient of that row's email). The order of these columns should correspond to the indices used to indicate the authors of the emails. The coulmn for the email's author should be 0. vocab file - each line represents a word type in the vocabulary - the order of the words must correspond to the order of the columns in the word matrix file --- data format example --- file : ./data/example-raw/doc-1.txt To: blue@example.com From: blah@example.com Subject: the apple Apple crisp! file : ./data/example-raw/doc-3.txt To: blue@example.com From: blech@example.com, blah@example.com Subject: the apple Tree! Tree, tree? Tree. (Tree) file : ./data/example-raw/doc-3.txt To: blue@example.com From: blech@example.com Subject: potato pie file : ./data/example/word-matrix.csv ./data/raw/doc-1.txt,1,2,1,0,0,0 ./data/raw/doc-2.txt,1,2,0,0,5,0 ./data/raw/doc-3.txt,0,0,0,1,0,1 file : ./data/example/edge-matrix.csv ./data/raw/doc-1.txt,2,0,1,0 ./data/raw/doc-2.txt,2,1,1,0 ./data/raw/doc-1.txt,0,0,0,1 file : ./data/example/vocab.txt the apple crisp tree potato pie --- training our main model --- # build the jar file (requires ant version 1.7 or higher) $ ant # make the output directory $ mkdir ./output # train the model (this might take a little while) # note that running the model for more iterations generally gives better results $ java -Xmx1G -cp build/jar/NetworkModels.jar experiments.ConditionalStructureExperiment --word-matrix=./data/nhc/word-matrix.csv --edge-matrix=./data/nhc/edge-matrix.csv --vocab=./data/nhc/vocab.txt --num-actors=30 --num-topics=50 --num-latent-dims=2 --alpha=0.01 --beta=0.01 --num-iter=1000 --print-interval=1 --save-state-interval=100 --verbose --out-folder=./output --- understanding the output --- doc_topics.txt.gz : document-specific topic proportions edge_state.txt.gz : assignments of recipients to tokens intercepts.txt : intercept for each topic-specific latent space latent_spaces.txt : positions for each topic-specific latent space. Each row is an entire space. The position of actor i is at the indices i*k, i*k + 1, ..., (i*k + 1) - 1, where k is the dimension of the space. log_like.txt : the log likelihood at particular iterations (frequency of calculation depends on the value of the print-interval option) log_prob.txt : the joint probability of the data and the model parameters at particular iterations determined by print-interval options.0.txt : some information about the job that was run topic_summary.txt.gz : the coherence and top ten words in each topic topic_words.txt.gz : topic-specific distributions over word types word_state.txt.gz : assignments of tokens to topics Files such as word_state.txt.gz.5 are created when --save-state-interval is greater than 0 and represent the same data as the stem file but at the iteration specified at the end of the file name. --- restarting a job from a specific iteration --- # run the commands above first $ java -Xmx1G -cp build/jar/NetworkModels.jar experiments.ConditionalStructureExperiment --word-matrix=./data/nhc/word-matrix.csv --edge-matrix=./data/nhc/edge-matrix.csv --vocab=./data/nhc/vocab.txt --num-actors=30 --num-topics=50 --num-latent-dims=2 --alpha=0.01 --beta=0.01 --num-iter=10 --print-interval=1 --save-state-interval=5 --verbose --out-folder=./output --read-from-folder --read-folder=./output --iter-offset=1000 --- more help --- # see all of the command line options $ java -cp build/jar/NetworkModels.jar experiments.ConditionalStructureExperiment --help ------------------------------------ ------------- OVERVIEW ------------- ------------------------------------ This repo contains implementations of MCMC samplers for several models. Each model is associated with a class in the experiments package that can be used to train that model. To run a particular model use: $ java -cp build/jar/NetworkModels.jar experiments.[ class name ] For help on the command-line arguments use: java -cp build/jar/NetworkModels.jar experiments.[ class name ] --help --- Model Classes --- Topic-Partitioned Multinetwork Embedding (Krafft et al., 2012) * ConditionalStructureExperiment Bernoulli TPME (Krafft et al., 2012) * ConditionalStructureBernoulliExperiment Bernoulli Link-LDA (Erosheva et al., 2004) * BernoulliEroshevaExperiment Mixed-Membership Latent Space Model * MMLSEMExperiment Mixed-Membership Stochastic Blockmodel (Airoldi et al., 2008) * Separately implemented in ./mmsb Latent Dirichlet Allocation (Blei et al., 2003) * LDAExperiment A simple baseline is also implemented. * EdgeFrequencyExperiment Special Cases: - LSM (Hoff et al., 2002) is a special case of MMLSEMExperiment when num-topics is one. ------------------------------------ --------------- Data --------------- ------------------------------------ We collected the NHC data ourselves. It is part of the public record. We downloaded the Enron data from http://www.infochimps.com/datasets/enron-email-data-with-manager-subordinate-relationship-metadata#overview_tab It is also part of the public record. ------------------------------------ ----------- DEPENDENCIES ----------- ------------------------------------ MALLET and its dependencies Apache Commons CLI version 1.2 GNU Trove build.xml requires ant version 1.7 or later To run MMSB you will need the a soft link to the boost C++ library in the mmsb directly. You can currently download this library here: http://www.boost.org/
xuefeicao/topic-partitioned-multinetwork-embeddings
Group project 2016 fall - Probabilistic Graphical Model
JavaNOASSERTION