Document-topic distribution

Question

Document-topic distribution

devanshrj opened this issue a year ago · 2 comments

Hi, thank you for this awesome work!

I would like to use KDTM to generate topics and document-topic distribution on a corpus containing 4.4M tweets (each tweet can be considered a document). Can you let me know how I can obtain the document-topic distribution? The closest method I can find to this is save_document_representations(), but I am not sure if it's the same thing.

Also, my dataset does not have any labels, so I wanted to know if labels are a part of the training process or if they are optional.

Thanks in advance!

Answer 1 · 2023-05-09T22:16:55.000Z

Thanks for your interest!

To your first question, that function will get document-topic distributions, but it's just a single sample. For a later paper, we modified the function to sample multiple times and take the mean (if I recall correctly, there's no analytical mean for a logistic-normal). You can see the modified code in this branch. In fact, if my commit history is to be trusted, you can view the exact changes here.

Labels (as well as covariates) are optional and all reported results are unsupervised.

Not that you asked, but you should also note that we realized the NPMI implementation in this repo (ported from the original Scholar paper) is nonstandard, and I believe we calculate it during training. You should prefer implementations from Gensim, OCTIS, Palmetto, or us. Of course, the best bet is to forgo automated metrics altogether 😉

Answer 2 · 2023-05-09T22:21:23.000Z

Another thing you didn't ask: we've found that mallet works surprisingly well with Tweets, in case you haven't tried it already and are looking for a good baseline.