MaartenGr/BERTopic

`IndexError: list index out of range` when using zeroshot_topic_list in 0.16.1

Opened this issue · 19 comments

Hi, I recently re-ran a notebook for zeroshot_topic_list and got the IndexError: list index our of range
I fixed this by downgrading to 0.16.0

Full stacktrace:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[2], line 18
      9 vectorizer_model = CountVectorizer(stop_words="english")
     11 topic_model = BERTopic(
     12     min_topic_size=20,
     13     zeroshot_topic_list=zeroshot_topic_list,
     14     zeroshot_min_similarity=.25,
     15     vectorizer_model=vectorizer_model
     16 )
---> 18 topics, probs = topic_model.fit_transform(docs)
     19 topic_model.get_topic_info()

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:448, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    446 # Combine Zero-shot with outliers
    447 if self._is_zeroshot() and len(documents) != len(doc_ids):
--> 448     predictions = self._combine_zeroshot_topics(documents, assigned_documents, assigned_embeddings)
    450 return predictions, self.probabilities_

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in BERTopic._combine_zeroshot_topics(self, documents, assigned_documents, embeddings)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

File /opt/conda/lib/python3.10/site-packages/bertopic/_bertopic.py:3682, in <listcomp>(.0)
   3680 cluster_indices = list(documents.Old_ID.values)
   3681 cluster_names = list(merged_model.topic_labels_.values())[len(set(y)):]
-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
   3684 df = pd.DataFrame({
   3685     "Indices": zeroshot_indices + cluster_indices,
   3686     "Label": zeroshot_topics + cluster_topics}
   3687 ).sort_values("Indices")
   3688 reverse_topic_labels = dict((v, k) for k, v in merged_model.topic_labels_.items())

Hmmm, this is surprising. Could you share your full code? That will make it easier to understand what is happening here. Also, I'm not seeing the actual error in your log. Does that mean that the error indeed happens at this line?

-> 3682 cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]

I have the same error.

@Bougeant Could you also share your code and error log? That would help me understand what is happening here.

Sure! Here goes:

pip install bertopic==0.16.1 datasets

import logging
import pandas as pd
import spacy
from sklearn.datasets import fetch_20newsgroups
from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

spacy.cli.download("en_core_web_md")

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
df = pd.DataFrame({"text": data['data'], "target": data['target']})
df = df.drop_duplicates(subset=["text"]).reset_index(drop=True)
classes = {i: data["target_names"][i] for i in range(len(data["target_names"]))}
df["target"] = df["target"].map(classes)

model_params = {
    "embedding_model": SentenceTransformer("all-MiniLM-L6-v2"),
    "calculate_probabilities": True,
    "representation_model": PartOfSpeech(model="en_core_web_md", top_n_words=20, pos_patterns=[[{"POS": "NOUN"}]]),
    "min_topic_size": 100,
    "nr_topics": 20,
    "zeroshot_topic_list": ["baseball", "hockey", "space", "medecine", "encryption", "middle-east politics", "cars", "motorcycle", "electronics", "computers", "religion"],
    "zeroshot_min_similarity": 0.5
}

topic_model = BERTopic(**model_params)
embeddings = topic_model.embedding_model.encode(df["text"], show_progress_bar=True)
topic_model.fit(df["text"].to_list(), embeddings)

This is the error I get:

cluster_topics = [cluster_names[topic + self._outliers] for topic in documents.Topic.values]
--> IndexError: list index out of range

It seems that the error comes from the fact that cluster_names should not include the outliers clusters, so the last index is out of range (we try to get the 14th element of a 13 elements list):

cluster_names = ['0_game_team_year_games', '1_health_patients_doctor_treatment', '2_car_bike_one_engine', '3_use_windows_one_system', '4_people_one_children_up', '5_people_arabs_one_peace', '6_health_mail_list_newsgroup', '7_space_launch_earth_orbit', '8_key_clipper_chip_encryption', '9_gay_people_sex_men', '10_post_people_one_flame', '11_one_will_people_christian', '12_fire_compound_children_people', '13_gun_guns_firearms_people']
topic = 13
self._outliers = 1

Hi,

I am having the same issue (zero shot topic modelling crashes at the exact same line).

The code:

representation_model = KeyBERTInspired()
vectorizer_model = CountVectorizer(
    ngram_range=(1, 2), stop_words="english", min_df=30
)
embedding_model = "all-MiniLM-L6-v2"
topic_model = BERTopic(
    verbose=True,
    embedding_model=embedding_model,
    min_topic_size=50,
    calculate_probabilities=True,
    low_memory=True,
    representation_model=representation_model,
    zeroshot_topic_list=labels,
    zeroshot_min_similarity=0.5,
    language="english",
    n_gram_range=(1, 2),
)
topics, probs = topic_model.fit_transform(articles["abstract"].tolist())

I have printed out the following variables before the crash:

len(cluster_names): 78
np.max(documents.Topic.values): 77
np.min(documents.Topic.values): -1
self._outliers: 1
len(set(y)): 13 (which is also equal to len(labels), the amount of input zero shot labels)

In other words, the issue is the same as that reported by @Bougeant.

sorry a bit late, but this is my code

from bertopic import BERTopic
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer

data = load_dataset("HuggingFaceH4/h4_10k_prompts_ranked_gen")
docs = data["train_gen"]["prompt"]

zeroshot_topic_list = ['searching knowledge', 'answer coding problem', 'summarizing', 'rephrasing', 'roleplay', 'translate', 'generate content']
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    min_topic_size=20,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.25,
    vectorizer_model=vectorizer_model
)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

I'm running this in kaggle notebook, and I think I missed adding the last line of the error, this is the full screenshot:

image

accidentally closed the issue, sorry

I've gotten around the problem with the following patch: master...lucasgautheron:BERTopic:patch-1

This is probably not the way you want to actually fix it, but I thought I should share

Thank you all for sharing the code! In all honesty, I'm not entirely sure why it suddenly seems to ignore outliers as the topic label should exist...

Either way, I think I managed to create a fix but it still has to pass all the tests. Also, seeing as how the tests didn't cover this specific issue. Could any facing this issue also test whether this fix worked for them? I would feel a lot more confident to have addressed this issue if it resolves it for more people than just on my machine.

Here's the PR: #1957

@lucasgautheron @andiwinata @Bougeant If you have the time, could you check whether #1957 works?

Hi! Any updates on that? This is a big blocker in my project right now.

@mzhadigerov Have you tested the PR I linked in my comment above? If that works for you and also for others, then I can go ahead and create a new release. Until then, please check out the PR.

@MaartenGr Thanks! It is working on my side. I cloned from fix_1946 branch.

image

@MaartenGr but my Representative_Docs of topic -1 are NaN for some reason, even though Count shows 424

@mzhadigerov The representative documents are not merged since they are essentially random documents when it concerns topic -1. Topic -1 consists of outliers that do not fall into a single group so the resulting documents are not actually related to one another.

I think it could be done to add representative documents there but in all honesty, I'm not sure it is worth the effort.

@MaartenGr Alright, If it is supposed to work like that (I don't use rep.docs of topic -1 anyways).

I made the comment because the Rep.Docs of -1 are not NaN in v0.16.0

@mzhadigerov Thanks for sharing. It is currently low priority but I might bump it if it's important to many users.

For everyone facing this issue in 0.16.1, I just pushed an official 0.16.2 release which has the PR I mentioned earlier implemented. There are a bunch of PRs open with a number of interesting stuff that I will look through in the upcoming weeks. For now, this issue should be resolved.

Thank you for the super quick patch; I could not try it yet, but it looks equivalent to my quickfix so I assume it works.