MaartenGr/BERTopic

Would you be open to a lightweight vectorizer/embedder?

Closed this issue · 4 comments

The datasets that I have tend to be ~80K examples and just running the embeddings on a CPU takes ~40 minutes.

I have, however, a trick up my sleeve.

from sklearn.pipeline import make_pipeline, make_union
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer

pipe = make_pipeline(
    make_union(
        HashingVectorizer(n_features=10_000),
        HashingVectorizer(n_features=9_000),
        HashingVectorizer(n_features=8_000)
    ),
    TfidfTransformer(),
    TruncatedSVD(100)
)

This pipeline combines a hashing trick, with a bloom hack, with a sparse pca trick and a tf-idf trick. One benefit is that this is orders of magnitude faster to embed; even when you include training.

import numpy as np
import perfplot
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
out = perfplot.show(
    setup=lambda n: docs[:n],  # or setup=np.random.rand
    kernels=[
        lambda a: pipe.fit_transform(a),
        lambda a: pipe.fit(a).transform(a),
        lambda a: sentence_model.encode(a)
    ],
    labels=["fit_transform", "fit_and_transform", "sentence_transformer"],
    n_range=[100, 200, 500, 1000, 2000, 5000, 10_000],
    xlabel="len(a)",
    equality_check=None
)

image

It's just orders of magnitude faster. So maybe it'd be nice to have these embeddings around?

But what about the quality of the embeddings?

Mileage can vary, sure, but I have some results here that suggest it's certainly not the worst idea either. When you compare the UMAP chart on top of tf/idf with the universal sentence encoder one then, sure ... the USE variant is intuitively better, but given the speedup, I might argue that the tf/idf approach is reasonable too.

There's a fair bit of tuning involved, and I'm contemplating a library that implements bloom vectorizers properly for scikit-learn. But once that is done and once I've done some benchmarking, would this library be recipient to such an embedder?

The datasets that I have tend to be ~80K examples and just running the embeddings on a CPU takes ~40 minutes.

Yep, you can pass custom embeddings to .fit(docs, embeddings=embeddings) but it would be much nicer to have a dedicated backend that properly supports CPU. Although it is great that there are so many pre-trained models out there, the necessity of needing a GPU is a bit of a shame as it raises the barrier of entry quite a bit for those that do not have a dedicated GPU.

This pipeline combines a hashing trick, with a bloom hack, with a sparse pca trick and a tf-idf trick. One benefit is that this is orders of magnitude faster to embed; even when you include training.

Just to be sure, the bloom hack that you refer to is this one, right?

Mileage can vary, sure, but I have some results here that suggest it's certainly not the worst idea either. When you compare the UMAP chart on top of tf/idf with the universal sentence encoder one then, sure ... the USE variant is intuitively better, but given the speedup, I might argue that the tf/idf approach is reasonable too.
It's just orders of magnitude faster. So maybe it'd be nice to have these embeddings around?

Definitely! I think this would appeal to those wanting a faster, CPU-based approach. Moreover, seeing as BERTopic is opting for as much modularity as possible, it makes sense to also provide more options focusing on speed whilst still providing good enough results.

There's a fair bit of tuning involved, and I'm contemplating a library that implements bloom vectorizers properly for scikit-learn.
But once that is done and once I've done some benchmarking, would this library be recipient to such an embedded?

Yes, I'm imaging something like this if it makes sense to generalize it to any scikit-learn pipelines:

from bertopic.backend import BaseEmbedder
from sklearn.utils.validation import check_is_fitted, NotFittedError

class SklearnEmbedder(BaseEmbedder):
    def __init__(self, pipe):
        super().__init__()
        self.pipe = pipe

    def embed(self, documents, verbose=False):
        try: 
            check_is_fitted(self.pipe)
            embeddings = self.pipe.transform(documents)
        except NotFittedError:
            embeddings = self.pipe.fit_transform(documents)

        return embeddings 

custom_embedder = SklearnEmbedder(pipe)
topic_model = BERTopic(embedding_model=custom_embedder)

Just tried the above on the 20 NewsGroups dataset and from my subjective perspective evaluating the output, I am quite impressed with how similar they seem compared to something like SentenceTransformer("all-MiniLM-L6-v2").

There is one thing to note though. BERTopic was initially built around pre-trained language models, meaning that it was assumed that those models would only need a single method (.encode(docs), .embed(docs), etc.) in order to generate the embeddings. Since we now have a fit/transform combo, this does not work out-of-the-box for BERTopic's .partial_fit method. Similarly, it was assumed that generating embeddings with the language models would work for both document and word embeddings.

Those are definitely not breaking issues and most features will run without any problems but it is something to take into account when opting for this back-end.

If you remove the TruncatedSVD and the TFIdfVectorizer it's all partial_fit compatible again. Also, I am realising, it's even possible to make pre-trained variants of this idea via scikit-partial.

The baby has a big bad diaper now. But I'll come back to this in due time.

How about this; to keep things simple, how about I make a PR for this feature with a big segment in the docs that explains some of the caveats?

Sure, that would be great!