/hlda

Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Hierarchical Latent Dirichlet Allocation

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation.

Hierarchical Topic Models and the Nested Chinese Restaurant Process

The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies

Implementation

  • hlda/sampler.py is the Gibbs sampler for hLDA inference, based on the implementation from Mallet having a fixed depth on the nCRP tree.

Installation

  • Simply use pip install hlda to install the package.
  • An example notebook that infers the hierarchical topics on the BBC Insight corpus can be found in notebooks/bbc_test.ipynb.

ZW: Additional Conda Setup

From a basic anaconda environment (2020.02), a number of additional packages need to be installed for the notebook to work out of the box:

  • conda update -n base -c defaults conda
  • conda install jupyter matplotlib nltk wordcloud

Black:

Tomotopy:

  • conda install py-cpuinfo
  • pip install tomotopy

Also make sure the necessary NLTK packages are downloaded and accessible.