BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
BERTopic supports guided, (semi-) supervised, and dynamic topic modeling. It even supports visualizations similar to LDAvis!
Corresponding medium posts can be found here and here. For a more detailed overview, you can read the paper.
Installation, with sentence-transformers, can be done using pypi:
pip install bertopic
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:
We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
After generating topics and their probabilities, we can access the frequent topics that were generated:
>>> topic_model.get_topic_info()
Topic Count Name
-1 4630 -1_can_your_will_any
0 693 49_windows_drive_dos_file
1 466 32_jesus_bible_christian_faith
2 441 2_space_launch_orbit_lunar
3 381 22_key_encryption_keys_encrypted
-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:
>>> topic_model.get_topic(0)
[('windows', 0.006152228076250982),
('drive', 0.004982897610645755),
('dos', 0.004845038866360651),
('file', 0.004140142872194834),
('disk', 0.004131678774810884),
('mac', 0.003624848635985097),
('memory', 0.0034840976976789903),
('software', 0.0034415334250699077),
('email', 0.0034239554442333257),
('pc', 0.003047105930670237)]
NOTE: Use BERTopic(language="multilingual")
to select a model that supports 50+ languages.
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:
topic_model.visualize_topics()
We can create an overview of the most frequent topics in a way that they are easily interpretable. Horizontal barcharts typically convey information rather well and allow for an intuitive representation of the topics:
topic_model.visualize_barchart()
Find all possible visualizations with interactive examples in the documentation here.
BERTopic supports many embedding models that can be used to embed the documents and words:
- Sentence-Transformers
- Flair
- Spacy
- Gensim
- USE
Sentence-Transformers is typically used as it has shown great results embedding documents meant for semantic similarity. Simply select any from their documentation here and pass it to BERTopic:
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
Flair allows you to choose almost any 🤗 transformers model. Simply select any from here and pass it to BERTopic:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
Click here for a full overview of all supported embedding models.
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented over time. Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:
import re
import pandas as pd
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
Then, we need to extract the global topic representations by simply creating and training a BERTopic model:
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)
From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this
by simply calling topics_over_time
and pass in his tweets, the corresponding timestamps, and the related topics:
topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)
Finally, we can visualize the topics by simply calling visualize_topics_over_time()
:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)
For quick access to common functions, here is an overview of BERTopic's main methods:
Method | Code |
---|---|
Fit the model | .fit(docs) |
Fit the model and predict documents | .fit_transform(docs) |
Predict new documents | .transform([new_doc]) |
Access single topic | .get_topic(topic=12) |
Access all topics | .get_topics() |
Get topic freq | .get_topic_freq() |
Get all topic information | .get_topic_info() |
Get representative docs per topic | .get_representative_docs() |
Get topics per class | .topics_per_class(docs, topics, classes) |
Dynamic Topic Modeling | .topics_over_time(docs, topics, timestamps) |
Update topic representation | .update_topics(docs, topics, n_gram_range=(1, 3)) |
Reduce nr of topics | .reduce_topics(docs, topics, nr_topics=30) |
Find topics | .find_topics("vehicle") |
Save model | .save("my_model") |
Load model | BERTopic.load("my_model") |
Get parameters | .get_params() |
For an overview of BERTopic's visualization methods:
Method | Code |
---|---|
Visualize Topics | .visualize_topics() |
Visualize Topic Hierarchy | .visualize_hierarchy() |
Visualize Topic Terms | .visualize_barchart() |
Visualize Topic Similarity | .visualize_heatmap() |
Visualize Term Score Decline | .visualize_term_rank() |
Visualize Topic Probability Distribution | .visualize_distribution(probs[0]) |
Visualize Topics over Time | .visualize_topics_over_time(topics_over_time) |
Visualize Topics per Class | .visualize_topics_per_class(topics_per_class) |
To cite the BERTopic paper, please use the following bibtex reference:
@article{grootendorst2022bertopic,
title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
author={Grootendorst, Maarten},
journal={arXiv preprint arXiv:2203.05794},
year={2022}
}