Experiments for Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models

This directory contains experiments for Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models by A. Terenin (@aterenin), M. Magnusson (@MansMeg), and L. Jonsson (@lejon), published in the 2020 Conference on Empirical Methods in Natural Language Processing.

Running

java -Xmx7g -jar PCPLDA-8.0.6.jar --run_cfg=data/cfg_ap.cfg

Code

Data

Caveats

  • Due to use of parallelism, random number seeds are used for initialization only, not for sampling.
  • CGCBIB and AP are preprocessed to only keep documents with at least 10 tokens.
  • All samplers are in PCPLDA-8.0.6.jar except for the direct assignment reference implementation, which is in mallet-ilda.jar, and the subcluster split-merge implementation, which is copyrighted by the authors and is available here.