/biterm-topic-model

Fork of original code for Biterm Topic Model to provide closer to real-world use interfaces

Primary LanguageC++Apache License 2.0Apache-2.0

Biterm Topic Model - minimal real world usage fork

From the original repository:

"Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:

P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}.

With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013."

Motivation

This fork aims at providing interfaces to the author's original code-base suitable for closer to real-world applications while making minimal modifications.

Usage / added to original project

Building

Run make in the repository's root.

Topic learning

Run script/train.py DOCUMENTS MODEL in the repository's root where DOCUMENTS is a file with one document (consisting of space separated tokens) per line and MODEL is the directory to create the model in.

Training parameters can be set as follows:

  • --num-topics K or -k K to set the number of topics to learn to K; this will default to K=20.
  • --alpha ALPHA or -a ALPHA to set the alpha parameter as given by the paper; this will default to ALPHA=K/50.
  • --beta BETA or -b BETA to set the beta paramter as given by the paper; this will default to BETA=5.
  • --num-iterations N_IT or -n N_IT to set the number of training iterations; this will default to N_IT=5.
  • --save-steps SAVE_STEPS or -s SAVE_STEPS to set the number of iterations to save model after; this will default to 500.

After training, the directory MODEL will contain

  • a file vocab.txt with lines ID TOKEN that encodes the documents' tokens into integer IDs
  • a file topics.csv with tab-separated topic, prob_topic, top_words where topic is a topic's ID z (\in [0..K-1]), prob_topic is P(z) and top_words is a comma-separated list of at most 10 tokens w with the highest value of P(w|z), i.e. the topic's highest probability tokens
  • a directory vectors/ that holds the actual model data, i.e. the values for P(z) and P(w|z) needed for topic inferral

Topic Inferral, i.e. P(z|d)

This fork provides a python class BTMInferrer in script/infer.py with an interface for fast topic inferral of single documents that can easily be implemented analogously in other programming languages.

Here, an instance i of BTMInferrer can initialized with the model's directory (see section Topic learning). A single document's topic vector can then be inferrered by calling i.infer(document), which will return a list of K values of type float that represents the K-dimensional vector P(z|d).

Notable changes from original repository

  • existing Makefile was revised for efficiency and to separate build from source
  • existing scripts were recreated to increase efficiency, adaptability and ease of use
  • existing C++ code was formatted by LLVM Coding Standards and dynamic inferral (through stdin/out) was added while making minimal changes and retaining all previous functionality
  • the original project's sample data has been removed to decrease the repository's size (once GitHub prunes expired refs)