A package of my (Weiwei Yang's) various tools (most for NLP). Feel free to email me at wwyang@cs.umd.edu with any questions.
- Check Out
- Dependencies
- Use YWW Tools in Command Line
- LDA (Latent Dirichlet Allocation) in Command Line
- tLDA in Command Line
- MTM in Command Line
- Other Tools in Command Line
- Use YWW Tools Source Code
- LDA Code Examples
- tLDA Code Examples
- MTM Code Examples
- Other Code Examples
- Citation
- References
git clone git@github.com:ywwbill/YWWTools-v2.git
- Java 8.
- Files in
lib/
. - Files in
dict/
.
java -cp YWWTools-v2.jar:lib/* yang.weiwei.Tools <config-file>
- Windows users
- Please replace
YWWTools-v2.jar:lib/*
withYWWTools-v2.jar;lib/*
. - If you encounter any encoding problems in command line (especially when processing Chinese), please add
-Dfile.encoding=utf8
in your command.
- Please replace
- In
<config-file>
, specify the tool you want to use:tool=<tool-name>
- Supported
<tool-name>
(case unsensitive) include- LDA: Latent Dirichlet allocation. Include a variety of extensions.
- TLDA: Tree LDA.
- MTM: Multilingual Topic Model.
- WSBM: Weighted stochastic block model. Find blocks in a network.
- SCC: Strongly connected components.
- Stoplist: Remove stop words. Support English only, but can support other languages given dictionary.
- Lemmatizer: Lemmatize POS-tagged corpus. Support English only, but can support other languages given dictionary.
- POS-Tagger: Tag words' POS. Support English only, but can support other languages given trained models.
- Stemmer: Stem words. Support English only.
- Tokenizer: Tokenize corpus. Support English only, but can support other languages given trained models.
- Corpus-Converter: Convert word corpus into indexed corpus (for LDA) and vice versa.
- Tree Builder: Build tree priors from word associations.
- You can always set
help
to true to see help information of- supported tool names if you don't specify a tool name:
help=true
- a specific tool if you specify it (take LDA as an example):
help=true tool=lda
- supported tool names if you don't specify a tool name:
tool=lda
model=lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
- Implementation of Blei et al. (2003).
- Required arguments
-
<vocab-file>
: Vocabulary file. Each line contains a unique word. -
<corpus-file>
: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format<doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
<doc-len>
is the total number of tokens in this document.<word-type-i>
denotes the i-th word in<vocab-file>
, starting from 0. Words with zero frequency can be omitted. -
<model-file>
: Trained model file in JSON format. Read and written by program.
-
- Optional arguments
model=<model-name>
: The topic model you want to use (default: LDA). Supported<model-name>
(case unsensitive) are- LDA: Vanilla LDA
- RTM: Relational topic model.
- Lex-WSB-RTM: RTM with WSB-computed block priors and lexical weights.
- Lex-WSB-Med-RTM: Lex-WSB-RTM with hinge loss.
- SLDA: Supervised LDA. Support multi-class classification.
- BS-LDA: Binary SLDA.
- Lex-WSB-BS-LDA: BS-LDA with WSB-computed block priors and lexical weights.
- Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with hinge loss.
- BP-LDA: LDA with block priors. Blocks are pre-computed.
- ST-LDA: Single topic LDA. Each document can only be assigned to one topic.
- WSB-TM: LDA with block priors. Blocks are computed by WSBM.
test=true
: Use the model for test (default: false).verbose=true
: Print log to console (default:true).alpha=<alpha-value>
: Parameter of Dirichlet prior of document distribution over topics (default: 1.0). Must be a positive real number.beta=<beta-value>
: Parameter of Dirichlet prior of topic distribution over words (default: 0.1). Must be a positive real number.topics=<num-topics>
: Number of topics (default: 10). Must be a positive integer.iters=<num-iters>
: Number of iterations (default: 100). Must be a positive integer.update=false
: Update alpha while sampling (default: false).update_interval=<update-interval>
: Interval of updating alpha (default: 10). Must be a positive integer.theta=<theta-file>
: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.output_topic=<topic-file>
: File for showing topics.topic_count=<topic-count-file>
: File for document-topic counts.top_word=<num-top-word>
: Number of words to give when showing topics (default: 10). Must be a positive integer.
tool=lda
model=rtm
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
rtm_train_graph=<rtm-train-graph-file>
- Implementation of Chang and Blei (2010).
- Jointly models topics and document links.
- Extends LDA.
- Semi-optional arguments
rtm_train_graph=<rtm-train-graph-file>
[optional in test]: Link file for RTM to train. Each line contains an edge in the formatnode-1 \t node-2 \t weight
. Node number starts from 0.weight
must be a non-negative integer.weight
is either 0 or 1 and is optional. Its default value is 1 if not specified.rtm_test_graph=<rtm-test-graph-file>
[optional in training]: Link file for RTM to evaluate. Can be the same with RTM train graph. Format is the same as<rtm-train-graph-file>
.
- Optional arguments
nu=<nu-value>
: Variance of normal priors for weight vectors/matrices in RTM and its extensions (default: 1.0). Must be a positive real number.plr_interval=<compute-PLR-interval>
: Interval of computing predictive link rank (default: 20). Must be a positive integer.neg=true
: Sample negative links (default: false).neg_ratio=<neg-ratio>
: The ratio of number of negative links to number of positive links (default 1.0). Must be a positive real number.pred=<pred-file>
: Predicted document link probability matrix file.reg=<reg-file>
: Doc-doc regression value file.directed=true
: Set all edges directed (default: false).
Lex-WSB-RTM: RTM with Lexical Weights and Weighted Stochastic Block Priors
tool=lda
model=lex-wsb-rtm
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
rtm_train_graph=<rtm-train-graph-file>
- Extends RTM.
- Optional arguments
wsbm_graph=<wsbm-graph-file>
: Link file for WSBM to find blocks. See WSBM for details.alpha_prime=<alpha-prime-value>
: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.a=<a-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.b=<b-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.gamma=<gamma-value>
: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.blocks=<num-blocks>
: Number of blocks (default: 10). Must be a positive integer.output_wsbm=<wsbm-output-file>
: File for WSBM-identified blocks. See WSBM for details.block_feature=true
: Include block features in link prediction (default: false).
Lex-WSB-Med-RTM: Lex-WSB-RTM with Hinge Loss
tool=lda
model=lex-wsb-med-rtm
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
rtm_train_graph=<rtm-train-graph-file>
- Implementation of Yang et al. (2016)
- See Zhu et al. (2012) and Zhu et al. (2014) for hinge loss.
- Extends Lex-WSB-RTM.
- Link weight is either 1 or -1.
- Optional arguments
c=<c-value>
: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.
SLDA: Supervised LDA
tool=lda
model=slda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
label=<label-file>
- Implementation of McAuliffe and Blei (2008).
- Jointly models topics and document labels. Support multi-class classification.
- Extends LDA.
- Semi-optional arguments
label=<label-file>
[optional in test]: Label file. Each line contains corresponding document's numeric label. If a document label is not available, leave the corresponding line empty.
- Optional arguments
sigma=<sigma-value>
: Variance for the Gaussian generation of response variable in SLDA (default: 1.0). Must be a positive real number.nu=<nu-value>
: Variance of normal priors for weight vectors in SLDA and its extensions (default: 1.0). Must be a positive real number.pred=<pred-file>
: Predicted label file.reg=<reg-file>
: Regression value file.
BS-LDA: Binary SLDA
tool=lda
model=bs-lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
label=<label-file>
- For binary classification only.
- Extends SLDA.
- Label is either 1 or 0.
Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors
tool=lda
model=lex-wsb-bs-lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
label=<label-file>
- Extends BS-LDA.
- Optional arguments
wsbm_graph=<wsbm-graph-file>
: Link file for WSBM to find blocks. See WSBM for details.alpha_prime=<alpha-prime-value>
: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.a=<a-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.b=<b-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.gamma=<gamma-value>
: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.blocks=<num-blocks>
: Number of blocks (default: 10). Must be a positive integer.directed=true
: Set all edges directed (default: false).output_wsbm=<wsbm-output-file>
: File for WSBM-identified blocks. See WSBM for details.
Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss
tool=lda
model=lex-wsb-med-lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
label=<label-file>
- See Zhu et al. (2012) and (Zhu et al. (2014) for hinge loss.
- Extends Lex-WSB-BS-LDA.
- Label is either 1 or -1.
- Optional arguments
c=<c-value>
: Regularization parameter in hinge loss (default: 1.0). Must be a positive real number.
BP-LDA: LDA with Block Priors
tool=lda
model=bp-lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
block_graph=<block-graph-file>
- Use priors from pre-computed blocks.
- Extends LDA.
- Semi-optional arguments
block_graph=<block-graph-file>
[optional in test]: Pre-computed block file. Each line contains a block and consists of one or more documents denoted by document numbers. Document numbers are separated by space.
- Optional arguments
alpha_prime=<alpha-prime-value>
: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.
ST-LDA: Single Topic LDA
tool=lda
model=st-lda
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
short_corpus=<short-corpus-file>
- Implementation of Hong et al. (2016).
- Each document can only be assigned to one topic.
- Extends LDA.
- Semi-optional arguments
short_corpus=<short-corpus-file>
[at least one ofshort_corpus
andcorpus
should be specified]: Short corpus file.
- Optional arguments
short_theta=<short-theta-file>
: Short documents' background topic distribution file.short_topic_assign=<short-topic-assign-file>
: Short documents' topic assignment file.
tool=lda
model=wsb-tm
vocab=<vocab-file>
corpus=<corpus-file>
trained_model=<model-file>
wsbm_graph=<wsbm-graph-file>
- Use priors from WSBM-computed blocks.
- Extends LDA.
- Semi-optional arguments
- Optional arguments
alpha_prime=<alpha-prime-value>
: Parameter of Dirichlet prior of block distribution over topics (default: 1.0). Must be a positive real number.a=<a-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.b=<b-value>
: Parameter of Gamma prior for block link rates (default: 1.0). Must be a positive real number.gamma=<gamma-value>
: Parameter of Dirichlet prior for block distribution (default: 1.0). Must be a positive real number.blocks=<num-blocks>
: Number of blocks (default: 10). Must be a positive integer.directed=true
: Set all edges directed (default: false).output_wsbm=<wsbm-output-file>
: File for WSBM-identified blocks. See WSBM for details.
tool=tlda
vocab=<vocab-file>
tree=<tree-prior-file>
corpus=<corpus-file>
trained_model=<model-file>
- Implementation of tree LDA (Boyd-Graber et al., 2007).
- Required arguments
-
<vocab-file>
: Vocabulary file. Each line contains a unique word. -
<tree-prior-file>
: Tree prior file. Generated by Tree Builder -
<corpus-file>
: Corpus file in which documents are represented by word indexes and frequencies. Each line contains a document in the following format<doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
<doc-len>
is the total number of tokens in this document.<word-type-i>
denotes the i-th word in<vocab-file>
, starting from 0. Words with zero frequency can be omitted. -
<model-file>
: Trained model file. Read and written by program.
-
- Optional arguments
test=true
: Use the model for test (default: false).verbose=true
: Print log to console (default: true).alpha=<alpha-value>
: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). Must be a positive real number.beta=<beta-value>
: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). Must be a positive real number.topics=<num-topics>
: Number of topics (default: 10). Must be a positive integer.iters=<num-iters>
: Number of iterations (default: 100). Must be a positive integer.update=false
: Update alpha while sampling (default: false).update_interval=<update-interval>
: Interval of updating alpha (default: 10). Must be a positive integer.theta=<theta-file>
: File for document distribution over topics. Each line contains a document's topic distribution. Topic weights are separated by space.output_topic=<topic-file>
: File for showing topics.topic_count=<topic-count-file>
: File for document-topic counts.top_word=<num-top-word>
: Number of words to give when showing topics (default: 10). Must be a positive integer.
tool=mtm
num_langs=<num-languages>
dict=<dict-file>
vocab=<vocab-files>
corpus=<corpus-files>
trained_model=<model-file>
- Implementation of Multilingual Topic Model (Yang et al., 2019).
- Required arguments
-
<num-languages>
: Number of languages. Must be a postive integer greater than 1. -
<dict-file>
: Dictionary file. Each line contains a word translation pair, represented by four elements separated by tab (\t): language ID of the first word, first word, language ID of the second word, second word. -
<vocab-files>
: Vocabulary files. One file for each language. File names are separated by comma (,). Each line contains a unique word. -
<corpus-files>
: Corpus files in which documents are represented by word indexes and frequencies. File names are separated by comma (,). One file for each language. Each line contains a document in the following format<doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
<doc-len>
is the total number of tokens in this document.<word-type-i>
denotes the i-th word in<vocab-file>
, starting from 0. Words with zero frequency can be omitted. -
<model-file>
: Trained model file. Read and written by program.
-
- Optional arguments
-
test=true
: Use the model for test (default: false). -
verbose=true
: Print log to console (default: true). -
alpha=<alpha-values>
: Parameter of Dirichlet prior of document distribution over topics (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number. -
beta=<beta-values>
: Parameter of Dirichlet prior of topic distribution over words (default: 0.01). One value for each language. Values separated by comma (,). Must be a positive real number. -
topics=<num-topics>
: Number of topics (default: 10). One value for each language. Values separated by comma (,). Must be a positive integer. -
iters=<num-iters>
: Number of iterations (default: 100). Must be a positive integer. -
update=false
: Update alpha while sampling (default: false). -
update_interval=<update-interval>
: Interval of updating alpha (default: 10). Must be a positive integer. -
theta=<theta-files>
: Files for document distribution over topics. One file for each language. File names are separated by comma (,). Each line contains a document's topic distribution. Topic weights are separated by space. -
rho=<rho-file>
: File for topic transformation matrices. Assuming there are$N$ languages, the file contains$N(N-1)$ matrices. Each matrix starts by a line of stringRho[i][j]
wherei
andj
indicate two languages. The following$K_i$ rows contains the topic transformation matrix from languagei
to languagej
, and each row has$K_j$ values separated by spaces, where$K_i$ and$K_j$ are the numbers of topics in languagesi
andj
respectively. -
output_topic=<topic-file>
: File for showing topics. -
topic_count=<topic-count-file>
: Files for document-topic counts. One file for each language. File names are separated by comma (,). -
top_word=<num-top-word>
: Number of words to give when showing topics (default: 10). Must be a positive integer. -
reg=<regularization-option>
: Regularization option (default: 0). 0 for no regularization, 1 for L1 norm, 2 for L2 norm, 3 for entropy, 4 for identity matrix. -
lambda=<lambda-value>
: The regularization coefficient (default: 0.0). Only effective whenreg
is not 0. -
tfidf=true
: Use TF-IDF weights as word translation pairs' weights (default: false). -
word_tf_threshold=<word-term-frequency-threshold>
: Ignore the word translation pairs if either word's term frequency is equal or lower than the given threshold (default: 0). One value for each language. Values are separated by comma (,). Must be non-negative integers.
-
tool=wsbm
nodes=<num-nodes>
blocks=<num-blocks>
graph=<graph-file>
output=<output-file>
- Implementation of Aicher et al. (2014).
- Find latent blocks in a network, such that nodes in the same block are densely connected and nodes in different blocks are sparsely connected.
- Required arguments
<num-nodes>
: Number of nodes in the graph. Must be a positive integer.<num-blocks>
: Number of blocks. Must be a positive integer.<graph-file>
: Graph file. Each line contains an edge in the formatnode-1 \t node-2 \t weight
. Node number starts from 0.weight
must be a non-negative integer.weight
is optional. Its default value is 1 if not specified.<output-file>
: Result file. The i-th line contains the block assignment of i-th node.
- Optional arguments
directed=true
: Set the edges as directed (default: false).a=<a-value>
: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.b=<b-value>
: Parameter for edge rates' Gamma prior (default: 1.0). Must be a positive real number.gamma=<gamma-value>
: Parameter for block distribution's Dirichlet prior (default 1.0). Must be a positive real number.iters=<num-iters>
: Number of iterations (default: 100). Must be a positive integer.verbose=true
: Print log to console (default: true).
tool=scc
nodes=<num-nodes>
graph=<graph-file>
output=<output-file>
- New implementation.
- Find strongly connected components in an undirected graph. In each component, every node is reachable from any other nodes in the same component.
- Arguments
<num-nodes>
: Number of nodes in the graph. Must be a positive integer.<graph-file>
: Graph file. Each line contains an edge in the formatnode-1 \t node-2
. Node number starts from 0.<output-file>
: Result file. Each line contains a strongly connected component and consists of one or more nodes denoted by node numbers. Node numbers are separated by space.
tool=stoplist
corpus=<corpus-file>
output=<output-file>
- New implementation.
- Only supports English, but can support other languages if dictionary is provided.
- Required arguments
<corpus-file>
: Corpus file with stop words. Each line contains a document. Words are separated by space.<output-file>
: Corpus file without stop words. Each line contains a document. Words are separated by space.
- Optional arguments
dict=<dict-file>
: Dictionary file name. Each line contains a stop word.
tool=lemmatizer
corpus=<corpus-file>
output=<output-file>
- A re-packaging of
opennlp.tools.lemmatizer.SimpleLemmatizer
. - Only supports English, but can support other languages if dictionary is provided.
- Required arguments
<corpus-file>
: Unlemmatized corpus file. Each line contains a unlemmatized, tokenized, and POS-tagged document.<output-file>
: Lemmatized corpus file. Each line contains a lemmatized document. Words are separated by space.
- Optional arguments
dict=<dict-file>
: Dictionary file name. Each line contains a rule in the formatunlemmatized-word \t POS \t lemmatized-word
.
tool=pos-tagger
corpus=<corpus-file>
output=<output-file>
- A re-packaing of
opennlp.tools.postag.POSTaggerME
(https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.postagger) - Only supports English, but can support other languages if model is provided.
- Required arguments
<corpus-file>
: Untagged corpus file. Each line contains a tokenized untagged document.<output-file>
: Tagged corpus file. Each line contains a tagged document. Each word is annotated asword_POS
.
- Optional arguments
model=<model-file>
: Model file name.
tool=stemmer
corpus=<corpus-file>
output=<output-file>
- A re-packaging of
PorterStemmer
(http://tartarus.org/~martin/PorterStemmer/index.html) - Only supports English.
- Arguments
<corpus-file>
: Unstemmed corpus file. Each line contains an unstemmed document. Words are separated by space.<output-file>
: Stemmed corpus file. Each line contains a stemmed document. Words are separated by space.
tool=tokenizer
corpus=<corpus-file>
output=<output-file>
- A re-packaging of
opennlp.tools.tokenize.TokenizerME
(https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.tokenizer) - Only supports English, but can support other languages if model is provided.
- Required arguments
<corpus-file>
: Untokenized corpus file. Each line contains a untokenized document.<output-file>
: Tokenized corpus file. Each line contains a tokenized document.
- Optional arguments
model=<model-file>
: Model file name.
tool=corpus-converter
get_vocab|to_index|to_word=true
word_corpus=<word-corpus-file>
index_corpus=<index-corpus-file>
vocab=<vocab-file>
- New implementation
- Arguments
-
get_vocab
,to_index
,to_word
: Only one of them should be true.get_vocab
: Collect vocabulary from<word-corpus-file>
and write them in<vocab-file>
.to_index
: Convert a word corpus file<word-corpus-file>
into an indexed corpus file<index-corpus-file>
and write the vocabulary in<vocab-file>
.to_word
: Convert an indexed corpus file<index-corpus-file>
into a word corpus file<word-corpus-file>
given vocabulary file<vocab-file>
.
-
<word-corpus-file>
: Corpus file in which documents are represented by words. Each line contains a document. Words are separated by space. -
<index-corpus-file>
: Corpus file in which documents are represented by word indexes and frequencies. Not required when using--get-vocab
. Each line contains a document in the following format<doc-len> <word-type-1>:<frequency-1> <word-type-2>:<frequency-2> ... <word-type-n>:<frequency-n>
<doc-len>
is the total number of tokens in this document.<word-type-i>
denotes the i-th word in<vocab-file>
, starting from 0. Words with zero frequency can be omitted. -
<vocab-file>
: Vocabulary file. Each line contains a unique word.
-
tool=tree-builder
vocab=<vocab-file>
score=<score-file>
tree=<tree-file>
- Implementation of Yang et al. (2017)
- Arguments
<vocab-file>
: Vocabulary file. Each line contains a unique word.<score-file>
: Word association file. Assume there are V words in<vocab-file>
. There are V lines in the<score-file>
. Each line corresponds to a word in the vocabulary and contains V float numbers which denote the word's association scores with all other words.<tree-file>
: The tree prior file.
- Optional Arguments
type=<tree-type>
: Tree prior type. 1 for two-level tree; 2 for hierarchical agglomerative clustering (HAC) tree; 3 for HAC tree with leaf duplication (default 1).child=<num-child>
: Number of child nodes per internal node for a two-level tree (default 10).thresh=<threshold>
: The confidence threshold for HAC (default 0.0).
To integrate my code into your project, please include YWWTools-v2.jar
and everything in lib/
to your project dependency.
Here are examples for running some algorithms in this package. For more information, please look at JavaDoc in doc/
.
-
Classes:
yang.weiwei.lda.LDA
andyang.weiwei.lda.LDAParam
. -
Training code example
LDAParam param = new LDAParam("vocab_file_name"); //initialize a parameter object and set parameters as needed LDA ldaTrain = new LDA(param); // initialize an LDA object ldaTrain.readCorpus("corpus_file_name"); ldaTrain.initialize(); ldaTrain.sample(100); // set number of iterations as needed ldaTrain.writeModel("model_file_name"); // optional, see test code example ldaTrain.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file ldaTrain.writeResult("topic_file_name", 10); // optional, write top 10 words of each topic to file ldaTrain.writeDocTopicCounts("topic_count_file_name") // optional, write document-topic counts to file
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); LDA ldaTest = new LDA(ldaTrain, param); // initialize with pre-trained LDA object // LDA ldaTest = new LDA("model_file_name", param); // or initialize with an LDA model in a file ldaTest.readCorpus("corpus_file_name"); ldaTest.initialize(); ldaTest.sample(100); // set number of iterations as needed ldaTest.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file ldaTest.writeDocTopicCounts("topic_count_file_name"); // optional, write document-topic counts to file
-
Class:
yang.weiwei.lda.rtm.RTM
. -
Extends LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); RTM ldaTrain = new RTM(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // read train graph ldaTrain.readGraph("test_graph_file_name", RTM.TEST_GRAPH); // read test graph ldaTrain.initialize(); ldaTrain.sample(100); ldaTrain.writePred("pred_file_name"); // optional, write predicted document link probabilities to file ldaTrain.writeRegValues("reg_value_file_name"); // optional, write doc-doc regression values to file
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); RTM ldaTest = new RTM(ldaTrain, param); // RTM ldaTest = new RTM("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // optional ldaTest.readGraph("test_graph_file_name", RTM.TEST_GRAPH); ldaTest.initialize(); ldaTest.sample(100); ldaTest.writePred("pred_file_name"); // optional, write predicted document link probabilities to file ldaTest.writeRegValues("reg_value_file_name"); // optional, write doc-doc regression values to file
-
Class:
yang.weiwei.lda.rtm.lex_wsb_rtm.LexWSBRTM
. -
Extends RTM.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); LexWSBRTM ldaTrain = new LexWSBRTM(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); ldaTrain.readGraph("test_graph_file_name", RTM.TEST_GRAPH); ldaTrain.readBlockGraph("wsbm_graph_file_name"); // optional, read graph for WSBM ldaTrain.initialize(); ldaTrain.sample(100); ldaTrain.writeBlocks("block_file_name"); // optional, write WSBM results to file
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); LexWSBRTM ldaTest = new LexWSBRTM(ldaTrain, param); // LexWSBRTM ldaTest = new LexWSBRTM("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readGraph("train_graph_file_name", RTM.TRAIN_GRAPH); // optional ldaTest.readGraph("test_graph_file_name", RTM.TEST_GRAPH); ldaTest.readBlockGraph("wsbm_graph_file_name"); // optional ldaTest.initialize(); ldaTest.sample(100); ldaTest.writeBlocks("block_file_name"); // optional
- Class:
yang.weiwei.lda.rtm.lex_wsb_med_rtm.LexWSBMedRTM
. - Extends Lex-WSB-RTM.
- Code examples are the same with Lex-WSB-RTM.
-
Class:
yang.weiwei.lda.slda.SLDA
. -
Extends LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); SLDA ldaTrain = new SLDA(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readLabels("label_file_name"); // read label file ldaTrain.initialize(); ldaTrain.sample(100); ldaTrain.writePredLabels("pred_label_file_name"); // optional, write predicted labels ldaTrain.writeRegValues("reg_value_file_name"); // optioanl, write regression values
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); SLDA ldaTest = new SLDA(ldaTrain, param); // SLDA ldaTest = new SLDA("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readLabels("label_file_name"); // optional ldaTest.initialize(); ldaTest.sample(100); ldaTest.writePredLabels("pred_label_file_name"); // optional ldaTest.writeRegValues("reg_value_file_name"); // optional
-
Class:
yang.weiwei.lda.slda.lex_wsb_bs_lda.LexWSBBSLDA
. -
Extends BS-LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); LexWSBBSLDA ldaTrain = new LexWSBBSLDA(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readLabels("label_file_name"); ldaTrain.readBlockGraph("wsbm_graph_file_name"); // optional, read graph for WSBM ldaTrain.initialize(); ldaTrain.sample(100); ldaTrain.writeBlocks("block_file_name"); // optional, write WSBM results to file
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); LexWSBBSLDA ldaTest = new LexWSBBSLDA(ldaTrain, param); // LexWSBBSLDA ldaTest = new LexWSBBSLDA("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readLabels("label_file_name"); // optional ldaTest.readBlockGraph("wsbm_graph_file_name"); // optional ldaTest.initialize(); ldaTest.sample(100); ldaTest.writePredLabels("pred_label_file_name"); // optional ldaTest.writeBlocks("block_file_name"); // optional
- Class:
yang.weiwei.lda.slda.lex_wsb_med_lda.LexWSBMedLDA
. - Extends Lex-WSB-BS-LDA.
- Code examples are the same with Lex-WSB-BS-LDA.
-
Class:
yang.weiwei.lda.bp_lda.BPLDA
-
Extends LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); BPLDA ldaTrain = new BPLDA(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readBlocks("block_file_name"); // read block file ldaTrain.initialize(); ldaTrain.sample(100);
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); BPLDA ldaTest = new BPLDA(ldaTrain, param); // BPLDA ldaTest = new BPLDA("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readBlocks("block_file_name"); // optional ldaTest.initialize(); ldaTest.sample(100);
-
Class:
yang.weiwei.lda.st_lda.STLDA
-
Extends LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); STLDA ldaTrain = new STLDA(param); ldaTrain.readCorpus("long_corpus_file_name"); ldaTrain.readShortCorpus("short_corpus_file_name"); ldaTrain.initialize(); ldaTrain.sample(100); ldaTrain.writeShortDocTopicDist("short_theta_file_name"); // optional, write short documents' topic distribution to file ldaTrain.writeShortDocTopicAssign("short_topic_assign_file_name"); // optional, write short documents' topic assignments to file
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); STLDA ldaTest = new STLDA(ldaTrain, param); // STLDA ldaTest = new STLDA("model_file_name", param); ldaTest.readCorpus("long_corpus_file_name"); ldaTest.readShortCorpus("short_corpus_file_name"); ldaTest.initialize(); ldaTest.sample(100); ldaTest.writeShortDocTopicDist("short_theta_file_name"); // optional ldaTest.writeShortDocTopicAssign("short_topic_assign_file_name"); // optional
-
Class:
yang.weiwei.lda.wsb_tm.WSBTM
-
Extends LDA.
-
Training code example
LDAParam param = new LDAParam("vocab_file_name"); WSBTM ldaTrain = new WSBTM(param); ldaTrain.readCorpus("corpus_file_name"); ldaTrain.readGraph("wsbm_graph_file_name"); // read graph file ldaTrain.initialize(); ldaTrain.sample(100);
-
Test code example
LDAParam param = new LDAParam("vocab_file_name"); WSBTM ldaTest = new WSBTM(ldaTrain, param); // WSBTM ldaTest = new WSBTM("model_file_name", param); ldaTest.readCorpus("corpus_file_name"); ldaTest.readGraph("wsbm_graph_file_name"); // optional ldaTest.initialize(); ldaTest.sample(100);
-
Classes:
yang.weiwei.tlda.TLDA
andyang.weiwei.tlda.TLDAParam
. -
Training code example
TLDAParam param = new LDAParam("vocab_file_name", "tree_prior_file_name"); //initialize a parameter object and set parameters as needed TLDA tldaTrain = new TLDA(param); // initialize a tLDA object tldaTrain.readCorpus("corpus_file_name"); tldaTrain.initialize(); tldaTrain.sample(100); // set number of iterations as needed tldaTrain.writeModel("model_file_name"); // optional, see test code example tldaTrain.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file tldaTrain.writeWordResult("topic_file_name", 10); // optional, write top 10 words of each topic to file tldaTrain.writeDocTopicCounts("topic_count_file_name") // optional, write document-topic counts to file
-
Test code example
TLDAParam param = new TLDAParam("vocab_file_name", "tree_prior_file_name"); TLDA tldaTest = new TLDA(tldaTrain, param); // initialize with pre-trained tLDA object // TLDA tldaTest = new TLDA("model_file_name", param); // or initialize with a TLDA model in a file tldaTest.readCorpus("corpus_file_name"); tldaTest.initialize(); tldaTest.sample(100); // set number of iterations as needed tldaTest.writeDocTopicDist("theta_file_name"); // optional, write document-topic distribution to file tldaTest.writeDocTopicCounts("topic_count_file_name"); // optional, write document-topic counts to file
-
Classes:
yang.weiwei.mtm.MTM
andyang.weiwei.mtm.MTMParam
. -
Training code example
MTMParam param = new MTMParam(vocabFileNames[]); //initialize a parameter object and set parameters as needed MTM mtmTrain = new MTM(param); // initialize a MTM object mtmTrain.readCorpus(corpusFileNames[]); mtmTrain.readWordAssociations("dict_file_name"); mtmTrain.initialize(); mtmTrain.sample(100); // set number of iterations as needed mtmTrain.writeModel("model_file_name"); // optional, see test code example mtmTrain.writeDocTopicDist(thetaFileNames[]); // optional, write document-topic distribution to files mtmTrain.writeResult("topic_file_name", 10); // optional, write top 10 words of each topic to file mtmTrain.writeDocTopicCounts(topicCountFileNames[]) // optional, write document-topic counts to files mtmTrain.writeTopicTransMatrices("rho_file_name"); // optional, write topic transformation matrices to file
-
Test code example
MTMParam param = new MTMParam(vocabFileNames[]); MTM mtmTest = new MTM(mtmTrain, param); // initialize with pre-trained MTM object // MTM mtmTest = new MTM("model_file_name", param); // or initialize with a MTM model in a file mtmTest.readCorpus(corpusFileNames[]); mtmTest.initialize(); mtmTest.sample(100); // set number of iterations as needed mtmTest.writeDocTopicDist(thetaFileNames[]); // optional, write document-topic distribution to files mtmTest.writeDocTopicCounts(topicCountFileNames[]); // optional, write document-topic counts to files
-
Classes:
yang.weiwei.wsbm.WSBM
andyang.weiwei.wsbm.WSBMParam
. -
Code example
WSBMParam param = new WSBMParam(); // initialize a parameter object and set parameters as needed WSBM wsbm = new WSBM(param); // initialize a WSBM object with parameters wsbm.readGraph("graph_file_name"); wsbm.init(); wsbm.sample(100); // set number of iterations as needed wsbm.printResults();
-
Class:
yang.weiwei.scc.SCC
. -
Code example
SCC scc = new SCC(10); // initialize with number of nodes scc.readGraph("graph_file_name"); scc.cluster(); scc.writeCluster("result_file_name");
-
Class:
yang.weiwei.tlda.TreeBuilder
. -
Code example
TreeBuilder tb = new TreeBuilder(); tb.build2LevelTree("score_file_name", "vocab_file_name", "tree_file_name", num_Child); // Build a two-level tree tb.hac("score_file_name", "vocab_file_name", "tree_file_name", threshold); // Build a tree with hierarchical agglomerative clustering (HAC) tb.hacWithLeafDup("score_file_name", "vocab_file_name", "tree_file_name", threshold); // Build a tree with HAC and leaf duplication
- Basically there are two ways to preprocess an English corpus for topic models as follows.
tokenization
->stop words removal
->stemming
tokenization
->POS tagging
->lemmatization
->stop words removal
- The first way is quick but with low word readability. The second one takes more time but produce better readability.
- Finally you may want to remove low (document-)frequency words, in order to accelerate topic modeling without hurting the performance.
-
If you use Tree Builder, please cite
@InProceedings{Yang:Boyd-Graber:Resnik-2017, Title = {Adapting Topic Models using Lexical Associations with Tree Priors}, Booktitle = {Empirical Methods in Natural Language Processing}, Author = {Weiwei Yang and Jordan Boyd-Graber and Philip Resnik}, Year = {2017}, Location = {Copenhagen, Denmark}, }
-
If you use Lex-WSB-RTM (aka LBS-RTM), Lex-WSB-Med-RTM (aka LBH-RTM), Lex-WSB-BS-LDA, and/or Lex-WSB-Med-LDA, please cite
@InProceedings{Yang:Boyd-Graber:Resnik-2016, Title = {A Discriminative Topic Model using Document Network Structure}, Booktitle = {Association for Computational Linguistics}, Author = {Weiwei Yang and Jordan Boyd-Graber and Philip Resnik}, Year = {2016}, Location = {Berlin, Germany}, }
-
If you use ST-LDA, please cite
@InProceedings{Hong:Yang:Resnik:Frias-Martinez-2016, Title = {Uncovering Topic Dynamics of Social Media and News: The Case of Ferguson}, Booktitle = {International Conference on Social Informatics}, Author = {Lingzi Hong and Weiwei Yang and Philip Resnik and Vanessa Frias-Martinez}, Year = {2016}, Location = {Bellevue, WA, USA} }
-
If you use MTM, please cite
@InProceedings{Yang:Boyd-Graber:Resnik-2019, Title = {A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability}, Booktitle = {Empirical Methods in Natural Language Processing}, Author = {Weiwei Yang and Jordan Boyd-Graber and Philip Resnik}, Year = {2019}, Location = {Hong Kong, China}, }
LDA: Latent Dirichlet Allocation
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research.
Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of Advances in Neural Information Processing Systems.
Med-LDA: Max-margin LDA
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research.
Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang. 2014. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research.
RTM: Relational Topic Model
Jonathan Chang and David M. Blei. 2010. Hierarchical relational models for document networks. The Annals of Applied Statistics.
Lex-WSB-Med-RTM: RTM with WSB-computed Block Priors, Lexical Weights, and Hinge Loss
Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2016. A discriminative topic model using document network structure. In Proceedings of Association for Computational Linguistics.
Lingzi Hong, Weiwei Yang, Philip Resnik, and Vanessa Frias-Martinez. 2016. Uncovering topic dynamics of social media and news: The case of Ferguson. In Proceedings of International Conference on Social Informatics.
WSBM: Weighted Stochastic Block Model
Christopher Aicher, Abigail Z. Jacobs, and Aaron Clauset. 2014. Learning latent block structure in weighted networks. Journal of Complex Networks.
Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. 2007. A topic model for word sense disambiguation. Empirical Methods in Natural Language Processing.
Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2017. Adapting topic models using lexical associations with tree priors. Empirical Methods in Natural Language Processing.
MTM: Multilingual Topic Model
Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2019. A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability. Empirical Methods in Natural Language Processing.