Bug? Number of topics obtained with Lee and Mimno (2014)
santoshbs opened this issue · 2 comments
When the same corpus on two different machines - both Linux - is fit with STM using K = 0
(results in using the algorithm in Lee and Mimno, 2014), the stm()
results in 73 topics on one machine and 58 topics on the other. This happened even with a second retry.
Is this a bug? How do we know which one is more/less correct?
The only difference is we are using RStudio on one to run the code and IntelliJ R IDE on the other. I would be surprised if this has anything to do with the different number of topics obtained.
Interesting. is this true even with the same seed set?
I don't think one is more or less correct. The set up for the Lee and Mimno algorithm involves a ton of approximations, the biggest of which is projecting the vocabularly from a likely ~10K dimensional space to a 3 dimensional space. So some information will definitely be lost!
It is definitely a stochastic algorithm though as the projection itself is stochastic. I wouldn't think of it as finding 'the right' number of topics and more as an interesting heuristic for setting the number of topics.