eigenfoo/old-eigenfoo.github.io

Why Latent Dirichlet Allocation Sucks | Eigenfoo

utterances-bot opened this issue · 2 comments

Why Latent Dirichlet Allocation Sucks | Eigenfoo

Latent Dirichlet allocation is a well-known and popular model in machine learning and natural language processing, but it really sucks sometimes. Here’s why.

https://eigenfoo.xyz/lda-sucks/

Hey nice read!

But how to perform topic modeling on short sentences? Because when I tried LDA on survey data, all the comments which are one liners are grouped into one topic without classifying them into their appropriate topics. Could you please suggest a solution/any other algorithm to overcome this?

Thanks in advance!

@Rashmirashu3 that's a difficult question! Do the other clusters look reasonable to you? If that's the case, have you considered treating the last topic as an "other" topic, and just ignore it in downstream analyses? If any case, here are some thoughts.

  1. I think the most straightforward thing to do would be "make your sentences longer": whether that means stitching several documents together (e.g. from the same writer, or from the same forum), or doing some clever feature engineering on the text (e.g. using some domain-specific text preprocessing).

  2. A related suggestion would be to think hard about the data you have at hand: why do you expect there to even be topics to model? Perhaps the metadata is more informative than the text itself: can you leverage that in any way?

  3. Finally, if you're feeling brave, there's a lot of research out there on how to handle topic modelling on short documents. This StackExchange answer and this Quora answer should offer a good start. I'd be interested to hear what you discover!