/stLDA-C_public

Single-topic LDA (DMM) with unsupervised clustering

Primary LanguageR

stLDA-C

Single topic LDA with clustering. Implementation for stLDA-C, developed to model Twitter posts and cluster users. This repository implements the method described in a (soon-to-be-released) paper by Tierney, Bail, and Volfovsky.

The primary use-case for this model is when you have many short texts (posts) from a large number of users, and you want to cluster both the texts and users simultaneously. The model extends the basic LDA framework in Blei, Ng, and Jordan (2003). Each post is a multinomial draw over all words with a single latent topic distribution identifying word frequencies. Users post about topics at different rates, modeled by a latent, Dirichlet-distributed random variable. Each user’s latent topic frequencies are drawn from a cluster-specific Dirichlet.

The model builds two intuitive improvements to traditional topic models applied to short text. The first is the single-topic-per-tweet. Because of sparsity in word co-occurrence in short documents, traditional topic models struggle to estimate meaningful topic distributions. By modeling strong dependence among words in the same tweet, topic estimation is dramatically improved because every word in a tweet is used to infer the latent topic distribution over words. The second is unsupervised clustering of users. Learning the topics for users who tweet infrequently is difficult because of small sample sizes. In traditional heirarchical modeling, noisy user-level estimates are shrunk towards a grand mean. With the cluster estimation in our model, noisily estimated parameters are shrunk towards the average of users they are most similar to. If one only observes a few tweets from a user about sports, for example, estimates of his or her topic distribution should be shrunk towards the typical topic selections of other users who talk about sports, rather than the average user who talks about a wide range of topics.

This code provides a collapsed Gibbs sampler to estimate each post’s topic, each user’s cluster, and cluster-specific Dirichlet parameters. demo_code.R loads the scripts, simluates data, and runs the method on that data.