HLTA is a novel method for hierarchical topic detection. Specifically, it models document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relationships between topics and words, which is conducive to the discovery of meaningful topics and topic hierarchies.
A basic version of HLTA is proposed here: Hierarchical Latent Tree Analysis for Topic Detection. Tengfei Liu, Nevin L. Zhang and Peixian Chen. ECML/PKDD 2014: 256-272
An accelarated version of HLTA is proposed by using Progressive EM: Progressive EM for Latent Tree Models and Hierarchical Topic Detection. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon and Zhourong Chen. AAAI 2016
A full version of HLTA with comprehensive discription as well as several extensions can be found at:
Latent Tree Models for Hierarchical Topic Detection.
Peixian Chen, Nevin L. Zhang et al.
An IJCAI tutorial and demonstration can be found at: Multidimensional Text Clustering for Hierarchical Topic Detection (IJCAI 2016 Tutorial) by Nevin L. Zhang and Leonard K.M. Poon
The original HLTA java call associated to the papers: Old HLTA Page
-
Download the
HLTA.jar
andHLTA-deps.jar
from the Release page. -
An all-in-one command for hierarchical topic detection. It brings you through data conversion, model building, topic extraction and topic assignment.
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./quickstart someName
If you are in windows, remember to use semicolon instead
java -cp HLTA.jar;HLTA-deps.jar tm.hlta.HTD ./quickstart someName
The output files include:
someName.sparse.txt
: the converted data, generated if data conversion is necessarysomeName.bif
: HLTA model filesomeName.html
: HTML visualizationsomeName.nodes.js
: a topic treesomeName.topics.js
: a document catalog grouped by topicslib
: Javascript and CSS files required by the main HTML filefonts
: fonts used by some CSS files
-
You can also do
java -cp HLTA.jar;HLTA-deps.jar tm.hlta.HTD documents.txt someName
Your
documents.txt
:One line is one single document. You can have many sentences as you want. The quick brown fox jump over the lazy dog. But the lazy dog is too big to be jumped over! Lorem ipsum dolor sit amet, consectetur adipiscing elit Maecenas in ligula at odio convallis consectetur eu ut erat
-
You may also run through the following subroutines to do data conversion, model building, topic extraction and topic assignment step by step.
-
Convert text files to bag-of-words representation with 1000 words and 1 concatenation:
java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert myData ./source 1000 1
After conversion, you can find:
myData.sparse.txt
: data in tuple format, i.e. lines of (docId, word) pairmyData.dict.csv
: the vocabulary list ('.dict-0.csv' is the list w/o concatenation, '.dict-1.csv' is after 1 concatenation, etc.)
You may put your files anywhere in ./source. It accepts txt and pdf.
./source/IAmPDF.pdf ./source/OneDocument.txt ./source/Folder1/Folder2/Folder3/HiddenSecret.txt
-
Split into training set and testing set if needed: (v2.1+)
java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert --testset-ratio 0.2 myData ./source 1000 1
- Build model through with maximum 50 em steps (uses StepwiseEM)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HLTA myData.sparse.txt 50 myModel
The output files include:
myModel.bif
: HLTA model file
-
Exract topic from topic model
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.ExtractTopicTree myTopicTree myModel.bif myDataset.sparse.txt
The output files include:
myTopicTree.html
: a websitemyTopicTree.nodes.js
: a topic tree stored in javascriptmyTopicTree.nodes.json
: a topic tree stored as jsonlib
: Javascript and CSS files required by the main HTML filefonts
: fonts used by some CSS files
-
You may use the "broadly defined topics" to speed up the process. Under this definition, more document will be categorized into a topic. (ref paper section 8.2.1)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.ExtractTopicTree --broad myTopicTree myModel.bif
-
Find out which documents belongs to that topic (i.e. inference)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.Doc2VecAssignment myModel.bif myData.sparse.txt myAssignment
The output files include:
myAssignment.topics.json
: a document catalog grouped by topicmyAssignment.topics.js
: a document catalog stored as javascript variablemyAssignment.arff
: doc2vec assignments in arff format
-
You may use the "broadly defined topics" to speed up the process. Under this definition, more document will be categorized into a topic. (ref paper section 8.2.1)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.Doc2VecAssignment --broad myModel.bif myData.sparse.txt topics
-
Evaluate by topic coherence
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.TopicCoherence myTopicTree.nodes.json myData.sparse.txt
-
Evaluate by topic compactness. (v2.3+)
java -Xmx4G -cp HLTA.jar:HLTA-deps.jar tm.hlta.TopicCompactness myTopicTree.nodes.json GoogleNews-vectors-negative300.bin
Download pre-trained word2vec model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
-
Compute topic compactness in Python Install gensim (https://radimrehurek.com/gensim/) before using the python codes for computing compactness scores in AAAI17 paper (http://www.aaai.org/Conferences/AAAI/2017/PreliminaryPapers/12-Chen-Z-14201.pdf). One pre-trained Word2Vec model by Google is available at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing. The description of the model can be found at https://code.google.com/archive/p/word2vec/ under the section "Pre-trained word and phrase vectors".
-
Evaluate by loglikelihood. You will need to create a testing set in advance. (v2.1+)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.PerDocumentLoglikelihood myModel.bif myData.test.sparse.txt
As introduced in Subroutine2 of Quick Example, we can train HLTA with default hyper-parameters by :
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HLTA myData.sparse.txt 50 myModel
HLTA also supports to tune hyper-parameters by (v2.3+):
java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA $trainingdata $EmMaxSteps $EmNumRestarts $EM-threshold $UDtest-threshold $outputmodel $MaxIsland $MaxTop $GlobalsizeBatch $GlobalMaxEpochs $GlobalEMmaxsteps $IslandNotBridging $SampleSizeForstructureLearn $MaxCoreNumber $parallelIslandFindingLevel $CT-threshold
For example (v2.3+),
java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA myData.sparse.txt 50 3 0.01 3 myModel 15 30 500 10 100 1 10000 2 1
Notice that, to speed up the training:
- $trainingdata: the file name of training data
- $EmMaxSteps: max steps in EM (default: 50)
- $EmNumRestarts: numner of restarters in EM (default: 3)
- $EM-threshold: threshold to control the stop of EM (default: 0.01)
- $UDtest-threshold: threshold to control whether the islands can pass UDtest (default: 3)
- $outputmodel: name of output model
- $MaxIsland: The maximum number of variables in one island (default: 15)
- $MaxTop: max variable numbers for top level (default: 30)
- $GlobalsizeBatch: batch size in global stepwise EM for parameter learning (default: 500)
- $GlobalMaxEpochs: max epoch number in global stepwise EM for parameter learning (default: 10)
- $GlobalEMmaxsteps: step numbers in global stepwise EM for parameter learning (default: 100)
- $IslandNotBridging: remove island bridging or not, the default value is 1 meaning to remove island bridging. (default: 1)
- $SampleSizeForstructureLearn: how many samples are used in structure leanring. (default: 10000)
- $MaxCoreNumber: means the number of parallel CPU process. (default: 2) Users can choose a suitable core number considering the scale of their dataset. The further analysis on the balance of speed and performance can be found paper. Notice that, this number should not exceed the CPU core number of your machine, otherwise, it will slow HLTA.
- $parallelIslandFindingLevel: when $MaxCoreNumber > 1, $parallelIslandFindingLevel means the max level that use parallel island finding. For example, $parallelIslandFindingLevel == 3 means level1, level2 and level3 use parallel island finding; while other levels use serial island finding.
- $CT-threshold: threshold to control whether the island can pass correlation test, leave this empty to skip correlation test (default: empty, that is no correlation test)
If you need to modify source code and recompile HLTA, please follow next steps to build a sbt directory and compile HLTA. If not, please skip this session.
-
Have Java 8 and sbt installed.
-
Git clone this repository
-
Change directory to the project directory. (e.g. user/git/hlta)
-
Run the following command to build the JAR files from source code:
sbt clean assembly assemblyPackageDependency && ./rename-deps.sh
The output are "HLTA.jar" and "HLTA-deps.jar", which are in "target/scala-2.12/" and are executable with the instruction in "Quick Example".
- Current Maintainer: Chun Fai Leung (cfleungac@connect.ust.hk) (The Hong Kong University of Science and Techonology)
- General questions: Leonard Poon (kmpoon@eduhk.hk) (The Education University of Hong Kong)
- PEM questions: Peixian Chen (pchenac@cse.ust.hk) (The Hong Kong University of Science and Technology)
- Prof. Nevin L. Zhang
- Peixian Chen
- Tao Chen
- Zhourong Chen
- Farhan Khawar
- Chun Fai Leung
- Tengfei Liu
- Leonard K.M. Poon
- Yi Wang