/hSBM_Topicmodel

Using stochastic block models for topic modeling

Primary LanguageHTMLGNU General Public License v3.0GPL-3.0

hSBM_Topicmodel

A tutorial for topic-modeling with hierarchical stochastic blockmodels using graph-tool.

Data

The corpus is saved in corpus.txt, where each line is a separate doc with words separated by whitespace. Optionally, we can provide a file with titles for the documents in titles.txt

Setup

Install graph-tool

We use the graph-tool package for finding topical structure in the word-document networks.

  • see the installation-instructions, where you will find packages for linux, etc.
  • an alternative for linux is to install via a conda-environment, see here

Get Jupyter notebook

In order to execture the tutorial-notebook, install jupyter, e.g.

pip install jupyter

Get hSBM-TopicModel repository

In order to do topic modeling with stochastic block models we need to get the code from the repositroy:

git clone https://github.com/martingerlach/hSBM_Topicmodel.git

Run the code

Start jupyter notebooks

jupter notebook

then select the 'TopSBM-tutorial'-notebook.

It will guide you through the different steps to do topic modeling with stochastic block models:

  • How to construct the word-document network from a corpus of text

  • How to fit the stochastic block model to the word-document network

  • How to extract the topics from the fitted model, e.g.

    • the most important words for each topic
    • the clustering of documents
    • the topic mixtures for each document
  • How to visualize the topical structure, in particular the hierarchy of topics