MaartenGr/BERTopic

Evaluating fit on new data?

Opened this issue · 1 comments

Evaluating fit on new data?

Is there a way to assess how well the first model 'fits' the second dataset? My intuition is that coherence wouldn't really be appropriate, but maybe I'm wrong? Could I look at the predictions on the new data and see how many fall into the -1 topic category? Or I could create a topic model from the smaller dataset (it's not too too small), look at the cosine similarity between each of the small-data-model topics and the larger data model topics to see if there are any topics unique to the small data that don't have a corresponding topic in the other model? Am I overthinking it? Thanks for any advice

What is essentially happening when you run the pre-trained model on unseen data is that it simply puts it in the clusters it has previously created, so additional evaluation would mean that you have to label these unseen data with using the clusters that were previously created.

Although it is an interesting experiment, you would have to define first what it is exactly you are measuring as it is still an unsupervised task.

My intuition is that coherence wouldn't really be appropriate, but maybe I'm wrong?

This is not possible since the topic representations do not change. You are merely assigning unseen documents to existing topics.