The state-of-the-art platform for topic modeling.
What is BigARTM?
BigARTM is a tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combinations of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.
Here are some examples of when you could use BigARTM:
- Build model from interactive ipython notebook. Demo
- Construct accurate text classifiers by very small training set. Demo
- Evaluate coherence-based quality metrics. Demo
- Query similar documents on different languages.
References
- Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections // Analysis of Images, Social Networks and Texts. 2015.
- Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.
- Vorontsov K., Potapenko A., Plavin A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. // Statistical Learning and Data Sciences. 2015 — pp. 193-202.
- Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.
Related Software Packages
- David Blei's List of Open Source topic modeling software
- MALLET: Java-based toolkit for language processing with topic modeling package
- Gensim: Python topic modeling library
- Vowpal Wabbit has an implementation of Online-LDA algorithm
How to Use
Installing
Download binary release or build from source using cmake:
$ mkdir build && cd build
$ cmake ..
$ make install
Command-line interface
Check out documentation for bigartm
.
Examples:
- Basic model (20 topics, outputed to CSV-file, infered in 10 passes)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20
- Basic model with less tokens (filtered extreme values based on token's frequency)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
- Simple regularized model (increase sparsity up to 60-70%)
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"
- More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics
bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"
Interactive Python interface
Check out the documentation for the ARTM Python interface in English and in Russian
Refer to tutorials for details on how to install and start using Python interface.
# A stub
import bigartm
model = bigartm.ARTM(num_topics=15
batch_vectorizer = artm.BatchVectorizer(data_format='bow_uci',
collection_name='kos',
target_folder='kos'))
model.fit_offline(batches, passes=5)
print model.phi_
Low-level API
Contributing
Refer to the Developer's Guide.
To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.
License
BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.