MaartenGr/BERTopic

Best-performing embedding models?

Opened this issue · 2 comments

I've been looking for up-to-date information about how various pre-trained models compare for clustering and topic modeling with BERTopic – rather than semantic search which is all the rage these days with RAG pipelines.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for similarity/clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl for BERTopic? I've done some informal tests, but I'm not convinced it results in better topics.

I would indeed advise looking at the MTEB leaderboard and specifically look at the clustering metric since that is what BERTopic is using mostly. In my experience the clusters are formed a bit better when using a model that scores higher on the leaderboard.

However, do note that small differences in clusters might not affect the topic representations that greatly if you have a relatively big dataset. You might see differences in smaller clusters but it will unlikely affect those larger clusters that already have good representations.

@raphael-milliere did you find anything?