/ru-BERTopic

Using BERT embeddings with c-TF-IDF for Topic modeling in Russian language

Primary LanguageJupyter Notebook

🔖 ru-BERTopic

Using BERT embeddings with c-TF-IDF for Topic modeling in Russian language

Getting started

Getting embeddings, tokenizing and lemmatizing large chunks of data takes some time, and if you just want to see how it works, you can download already preprocessed dataset of russian news here -> Google Drive.
Or you can just scroll down to the end of the notebook below and play with the visualisation

Name Link
ru-BERTopic main notebook Open In Colab

Visualisation

Visualization implemented with plotly provides semi-interactive figures. You can play with it in the notebook above. Hovering over a point representing a single document reveals its topic and id for deeper digging-in.