MaartenGr/BERTopic

Assigning documents to multiple topics using zero-shot topic modeling

Opened this issue · 1 comments

Goal

I am interested in fitting a BERTopic model using zero-shot topic modeling. I want it to be possible for documents to be assigned to more than one of my suggested topics. I have patched several BERTopic functions to enable this but wanted to get the author's opinion on correctness or alternatives.

The current implementation assigns documents to at most one suggested topic based on a specified cosine similarity threshold during model construction. If the threshold is met for a specific document, it is assigned to the topic with which it has the highest similarity.

My Approach

My first change is to the _zeroshot_topic_modeling function. I calculate which topics each document has a similarity with that exceeds the specified threshold. Next, if a document has more than one match, additional copies of that document (and its embedding) are made as necessary (keeping copies adjacent in the list of documents). Because this function does not have access to my documents outside of BERTopic, I set an instance variable that provides enough information to make copies of my documents and embeddings as necessary.

In the _combine_zeroshot_topics step, there is an occasional issue where the merged model topics is set as a np.ndarray rather than a list, which causes problems later on. Fixing this is my second patch.

The next step is to reduce_topics. Here, multiple zero-shot topics may essentially be merged, causing the new topics to sometimes have duplicated documents. I have patched _reduce_to_n_topics just before documents.Topic = new_topics to remove duplicate documents within a topic based on external document IDs I provide via an instance variable. I keep only unique (topic_id, external_document_ID) pairs. That instance variable with external document IDs is updated to a (possibly) reduced list of IDs, which I use outside of BERTopic to update my list of documents and embeddings.

The next step is reduce_outliers, where documents is my expanded list of documents (with potential duplicates) after fitting. I do not believe there is any risk here from duplicated documents, because any outliers that get reclassified only had one copy anyway.

The last step is update_topics using the updated documents list and topic IDs from the reduce_outliers step. Because there is no reorganization of topics, I believe there is no risk here from duplicated documents.

After all this, I have postprocessing to determine the list of topics for each of my original documents.

Questions

Keeping in mind my original goal, are there any apparent flaws in this approach or suggestions for improvement. I understand there are some other methods out there related to multiple topics per document, such as Topic Distributions or using the probabilities that are returned by transforming my documents after all my steps, but I have not had much luck getting any sort of useful distribution, and a probability matrix is only returned on fit.

One alternative I thought of is to update topic_model.topics_ after fitting based on a threshold and the probabilities, update my documents and embeddings accordingly, and then keep the reduce_topics patch to avoid duplicates. This would have the benefit of multiple topics for a document not just for the suggested topics but also for ones that came for clustering. A downside is an additional threshold to specify.

Thoughts?

A very interesting approach to the problem! Thanks for sharing such an extensive description of the process.

Keeping in mind my original goal, are there any apparent flaws in this approach or suggestions for improvement. I understand there are some #814 out there related to multiple topics per document, such as Topic Distributions or using the probabilities that are returned by transforming my documents after all my steps, but I have not had much luck getting any sort of useful distribution, and a probability matrix is only returned on fit.

Interestingly, I think this might be solved a little easier than your implementation. If I am not mistaken, what you are essentially doing is running cosine similarity between the document and the zero-shot topics and assigning a single document to multiple zero-shot topics if it exceeds a certain threshold.

Although your approach seems valid, it might be a bit easier if you would look at BERTopic's .fit and .transform as two separate processes:

  • .fit is mainly used to derive the topic representations. It is meant to create reasonable (whatever that means) representations of the topics and it's main output are those representations, such as labels and words.
  • .transform, in contrast, is used to create the topic assignments where the documents are actually assigned to their respective topics.

As long as you are happy with the topic representations during .fit, regardless of whether the documents are correctly assigned to one or more topics, there is no need to go through your process. Instead, you would need to focus on the topic assignment primarily by using a method like .approximate_distribution which you mentioned does not give you a useful distribution.

My first question would be of course, why? Why isn't it useful to you since it does return a probability matrix of sorts that you can use with a user-specified threshold.

Having said that, you could also simply save the model using .safetensors and then load the model. What happens is that the underlying dimensionality reduction and cluster models are removed. Now, whenever you run .transform, it will use the cosine similarity between topic and document embeddings to generate the exact same similarity matrix that you have created manually. You can use that output to assign a single document to multiple topics using the same threshold you specified for zero-shot topic modeling.

The above is a bit of a hidden trick which I would like to make more visible. I hope in the coming months to have some time to create a variable in .transform that will allow you to select the method of prediction, for instance:

topics, probs = topic_model.transform(documents, method="embeddings")

Hope this helps!