Does contextual information always increase coherence?
PearlSikka opened this issue · 4 comments
Hi, I've been working on topic models for tweets. I trained my corpus on LDA model as well as CTM and ProdLDA. However, the coherence score for LDA is always higher for different number of topics. I was wondering why that would be? Are there any specific cases where LDA might perform better than CTM and ProdLDA? I've read through papers for both CTM and ProdLDA and looks like for different datasets, CTM performed better but for my use case, it isn't the case. I would really appreciate if I could get some help/examples on why contextual information might not help.
Thanks
Hello,
I think I would need some more details. Can you please answer these questions?
- Which coherence measure are you using? for how many words?
- Which kind of preprocessing did you apply to the tweets?
- What are the characteristics of the dataset? (How many documents, the average length, etc)
- Which pre-trained language model did you use to generate the embeddings?
- Did you use any particular hyperparameter configuration for the models?
In general, I would say that CTM should perform better than LDA and ProdLDA in the context of tweets because tweets are usually composed of few words and the pre-trained embeddings can really help the topic model to find similar topical document representations. But I think I can better answer your question if I have more context :)
Thanks
Hi Silvia,
Apologies for late reply.
I am using NPMI coherence measure. Also tried with UMass coherence measure.
For preprocessing, I've removed all punctuations, URLs, contractions, hashtags, mentions are removed. Stop-words and emoticons are further removed to improve the quality of topics generated. The four keywords “corona”, “Wuhan”, “nCov” and “covid” were filtered out to reduce noise in the topics inference. The tweets are split into tokens onto which lemmatization is performed to preserve the meaning of the words.
There are about 3300 tweets. I've attached corpus and vocabulary for your reference.
I ran CTM after hyperparameter tuning with
CTM(num_topics=n_top, inference_type="combined", activation='softplus',num_epochs=30 , bert_model='bert-base-nli-mean-tokens' , use_partitions=False, num_layers= 2, num_neurons=85, dropout= 0.65)
and LDA with
LDA(num_topics=n_top, iterations=30, random_state=100, chunksize=200, passes=10)
I didn't use partitioning for training.
I know I might be asking for too much help but this has kept me confused for a while now :)
Thanks a lot
vocabulary.txt
corpus.zip
Hello PearlSikka,
CTM implemented in OCTIS is slightly different because OCTIS uses preprocessed text to generate the contextualized embeddings. It may be that in OCTIS we might lose some information for this reason. My suggestion would be to try using CTM library directly (https://github.com/MilaNLProc/contextualized-topic-models).
Another aspect that we noticed is that using a better contextualized model results in more coherent topics. For example, we compared BERT (the model that you are using) with RoBERTa (stsb-roberta-large) and RoBERTa worked better.
Finally, the vocabulary size might have an impact as well. You could try to reduce it by removing the least frequent words.
Thanks for your patience. You may have already solved your issues, but I hope it can help anyways :)
Silvia
Thank you so much Silvia.