Automatic labeling of topic models.
The alogirithm is described in Automatic Labeling of Multinomial Topic Models
We model the abstracts of NIPS 2014
(NIPS abstracts from 2008 to 2014 is available under datasets/
).
Meanwhile, we contrain the labels to be tagged as NN,NN
or JJ,NN
and use the top 200 most informative labels.
>>> python label_topic.py --line_corpus_path datasets/nips-2014.dat --preprocessing wordlen tag --label_tags NN,NN JJ,NN --n_cand_labels 200
...
Topical words:
--------------------
Topic 0: model data framework clustering information distributions two number world propose noise real work small
Topic 1: learning algorithm time problem online regret information decision conditional new stochastic algorithms selection problems
Topic 2: algorithm algorithms problem results learning optimal show function class functions graph bounds based general
Topic 3: learning training networks data tasks features neural kernel performance classification model datasets feature deep
Topic 4: matrix method sparse convex problems methods dimensional problem rank analysis propose regression norm gradient
Topic 5: model models inference approach data linear based gaussian method methods process sampling structure time
Topical labels:
--------------------
Topic labels:
Topic 0: neural population, inference algorithm, likelihood estimator, stochastic optimization, matrix recovery, paper develop, empirical study, covariance matrix
Topic 1: bandit problem, near-optimal regret, function approximation, paper consider, general class, multi-armed bandit, value function, statistical learning
Topic 2: logarithmic factor, statistical learning, convergence rate, communication cost, other hand, main result, solution quality, function approximation
Topic 3: pascal voc, major challenge, natural language, paper introduce, object recognition, policy search, empirical study, classification accuracy
Topic 4: low-rank tensor, low-rank matrix, matrix recovery, coordinate descent, problem finding, direction method, statistical learning, risk minimization
Topic 5: inference algorithm, introduce novel, exponential family, probabilistic inference, neural population, value function, policy search, other hand
For example:
>>> python label_topic.py --line_corpus_path datasets/nips-2014.dat --preprocess wordlen tag --label_tags NN,NN
For more details:
>>> python label_topic.py --help
Please refer to label_topic.py
.
The current version goes through the following steps
- Preprocessing using nltk's
word_tokenize
,stem
andpos_tag
. - Candidate phrase detection using pointwise mutual information: POS tag constraint can be applied. For now, only bigrams are considered.
- Topic modeling using LDA.
- Candidate label ranking using the algorithm here.
- Better phrase detection thorugh better POS tagging
- Better ways to compute language models for labels to support
intra topical coverage
heuristic(which is now disabled) - Support for user defined candidate labels
- Faster PMI computation(using Cythong for example)
- More flexibity/option on preprocessing
- Leveraging knowledge base to refine the labels