MaartenGr/BERTopic

using representation model takes much longer

Opened this issue · 1 comments

Hello i am running BERTopic on a mabook pro m1 with the following parameters using precomputed embeddings with sentence transformer

vectorizer_model = CountVectorizer(stop_words="english") 
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = OpenAI(openai_client, model="gpt-3.5-turbo", delay_in_seconds=10, chat=True)

BERTopic( 
    vectorizer_model =   vectorizer_model,
    ctfidf_model      =   ctfidf_model,
    nr_topics        =  'auto',
    min_topic_size   =   max(int(len(docs)/800),10),
    representation_model = representation_model )

i noticed a big difference in the fitting time of the model using the representation model, 5 min without, 35 min with

is there a particular reason for that? since it is something that should be run at the end of the entire process of topic modeling it should not take 30 min to retrieve keywords and docs and send a prompt to chat gpt, maybe there is something happening that i am not aware of, thanks in advance

is there maybe a possibility to add the representation layer after fitting the model?

That's because you have set delay_in_seconds=10 which means that between each prompt there will be a delay of 10 seconds to prevent time-out/ratelimit errors. These generally appear when you have a free account with OpenAI as those accounts have a strict rate limit. If you want it to be faster, simply remove that parameter.