MaartenGr/BERTopic

hierarchical_topics() produce incorrect output when three topics have the same distance

Opened this issue · 4 comments

Hi there,

I have noticed that hierarchical_topics(...) method produces incorrect results when three or more topics have the same (tf-idf) distances. Let me illustrative it with an example.

from umap import UMAP
from bertopic import BERTopic

docs = (
    ["banana"] * 300 
    + ["banana apple"] * 300 
    + ["pear"] * 300 
    + ["lemon"] * 300 
    + ["clock"] * 300 
)

model = BERTopic(umap_model=UMAP(random_state=42))
topics, probs = model.fit_transform(docs)
hr = model.hierarchical_topics(docs)
hr

This outputs
image

The cluster with Parent_ID == 8 includes topics [1, 2, 3] but topic 3 is not mentioned in either the left or right child or their childs.

Why is it happening?
The flat structure is created in each iteration. There is no guarantee that the new cluster will contain only two topics while the code that follows presumes that.

What should be the expected behavior?
I think that a new cluster should emerge. Essentially, the structure should look like this

Parent_Id, Child_Left_ID, Child_Right_ID
     8               1               11
    11               2               3

Thank you for the reproducible example! This indeed seems to be an issue. Interestingly, I expect this would not happen often since distances are seldom the same right? How did you then stumble upon this issue?

Either way, seeing as you already dove into the code, any suggestions how this could easily be resolved? Perhaps check for similar distances beforehand?

I encountered the problem when I had small clusters and non of them shared the same word. This I think could occur from time to time.

I think checking the distances beforehand is a good idea. At least to give a warning to start with. A possible fix could be (deterministically) select a topic that has the same distance ith two other topics and add a tiny bit to the distance metric. Ideally this should be done only for the calculation of the hierarchy, not when the distances are stored to the output dataframe. What do you think @MaartenGr ?

A possible fix could be (deterministically) select a topic that has the same distance ith two other topics and add a tiny bit to the distance metric.

If it's small enough that it's not significant, this would be the easiest solution without needing to change much to the source code. We could also add some (very slight perhaps uniform) noise to the entire distance metric such that all distances are affected equally. Either way, this sounds like a good approach as long as we can keep the noise small and logical.

If you have the time, a PR would be appreciated! If not, I'll put it on my backlog but I'm not sure if I'm able to tackle this in the coming months.

I like the approach that adds the noise uniformly! Let me work it out. @MaartenGr, it seems that I don't have enough permission to assign the issue to myself. Could you pls help me out in this?