bstewart/stm

Bug? Number of topics obtained with Lee and Mimno (2014)

santoshbs opened this issue · 2 comments

When the same corpus on two different machines - both Linux - is fit with STM using K = 0 (results in using the algorithm in Lee and Mimno, 2014), the stm() results in 73 topics on one machine and 58 topics on the other. This happened even with a second retry.

Is this a bug? How do we know which one is more/less correct?

The only difference is we are using RStudio on one to run the code and IntelliJ R IDE on the other. I would be surprised if this has anything to do with the different number of topics obtained.

Interesting. is this true even with the same seed set?

I don't think one is more or less correct. The set up for the Lee and Mimno algorithm involves a ton of approximations, the biggest of which is projecting the vocabularly from a likely ~10K dimensional space to a 3 dimensional space. So some information will definitely be lost!

It is definitely a stochastic algorithm though as the projection itself is stochastic. I wouldn't think of it as finding 'the right' number of topics and more as an interesting heuristic for setting the number of topics.

@bstewart - Many thanks for the response!

Yes, I am using the same seed on both Linux machines. I guess even if it is meant for heuristic purposes, the fact that we get 58 and 73 topics does not help.