bstewart/stm

[Question] Difference between STM and LDA

santoshbs opened this issue · 2 comments

I have been using STM to analyze tweets. The stm() and estimateeffect() have been extremely useful in understanding how characterstics of the Twitter user (e.g., gender, ethnicity, age, region) relate to the use of specific topics, while controlling for several covariates related to Twitter handle.

A major thing I am having difficulty is in justifying the utility of STM over LDA. Much of my audience is familiar with LDA, but not STM. One suggestion I have often received is to use the most common LDA packages like Mallet and then run regressions with topic weight as the DV and all variables related to Twitter handle as IVs.

It is in this regard, I wanted to seek your expertise on. Is there any distinct advantage that STM's estimateeffect() offers over R's regression functions. As they suggest, could I not call stm() without specifying any covriates, and then use glm(family= 'binomial') and glmer() with topic weight as the DV?

I'm going to assume for this answer that you are only using prevalence covariates (content covariates make a big difference). The prevalence covariates in STM are mostly helpful when the documents are quite short because when the documents are longer the data will overwhelm the prior and it won't matter as much. STM also uses a logistic normal prior instead of the Dirichlet for LDA but honestly for longer documents I don't think that makes a dramatic difference relative to collapsed Gibbs (what Mallet uses) for LDA—both yield a fairly rich posterior approximation that allows correlation in the posterior distribution..

I think people who prefer STM largely choose it because the initialization is quite good (so you get good looking answers), there are tons of tools to help you do the visualizations etc. and things are quite easy. In that sense the feature set is more robust than Mallet. You certainly could use other functions to do the post estimation of STM.

Re: audience, I think you can just tell them that STM is quite like LDA. Also depending on your field more people may be familiar with it than you think, its been used in a lot of applications at this point.

Thank you @bstewart.

I am using only prevalence covariates. As I understand from your note, they will be useful as my documents are very short (each document is a single tweet).

Two follow-up questions:

  1. Since I am using each tweet as a document, what is the best way to control for the nested nature of tweets within their Twitter handles. Will adding twitter_handle as a prevalence covariate be enough (i.g. stm(prevalence ~ twitter_handle, ....)) ? If not is there a way to specify the random effect/nested nature as prevalence covariate?

  2. Is it fine to use polynomial terms as prevalence covariates?