MaartenGr/BERTopic

self._c_tf_idf can make more efficient use of vectorizer model

Closed this issue · 3 comments

Hi Maarten,
When running merge_topics I noticed that it could sometimes be slow for large datasets and started looking into the source code for c-tf-idf. I found that vectorizer_model (at least in case of CountVectorizer) takes most of the time, which is being called in the self._c_tf_idf method:

        if partial_fit:
            X = self.vectorizer_model.partial_fit(documents).update_bow(documents)
        elif fit:
            self.vectorizer_model.fit(documents)
            X = self.vectorizer_model.transform(documents)
        else:
            X = self.vectorizer_model.transform(documents)

When the model is being fit, or when merge_topics is ran, it will call self._c_tf_idf and subsequently self.vectorizer_model.fit() and X = self.vectorizer_model.transform() afterwards. However, calling self.vectorizer_model.fit_transform() is more efficient (can be roughly twice as efficient): source code

More specifically, in CountVectorizer, fit() calls fit_transform() so by calling fit() and then transform() we are doing twice the same work for transform, whereas we could just call fit_transform() directly instead.

So my proposal would be:

        if partial_fit:
            X = self.vectorizer_model.partial_fit(documents).update_bow(documents)
        elif fit:
            X = self.vectorizer_model.fit_transform(documents)
        else:
            X = self.vectorizer_model.transform(documents)

Let me know if you agree or if you have any concerns. If you agree, let me know what test evidence you'd like to see, and then I can make a pull request.

Thanks for sharing this! Any speed-up is highly appreciated.

. However, calling self.vectorizer_model.fit_transform() is more efficient (can be roughly twice as efficient)

If you have the time, I would love to see some (perhaps existing) benchmarks on this! Twice as efficient seems rather large but would be a great speed-up.

Hi Maarten,

You can use below code to reproduce my results. I increased the size of the dataset to see how this thing scales for somewhat larger datasets.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

# Multiply by 5 to create larger dataset to have a more robust estimate
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']*5

# fit + transform
pre_vectorizer_model = CountVectorizer(ngram_range=(1,2), stop_words="english")
pre_vectorizer_model.fit(docs)
X = pre_vectorizer_model.transform(docs)

# more efficient method
pre_vectorizer_model = CountVectorizer(ngram_range=(1,2), stop_words="english")
X = pre_vectorizer_model.fit_transform(docs)

image

Awesome, thanks for testing this! As soon as the PR passes, I'll make sure to merge this.