/SimpleTopicModel

Easily identifying themes in text

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

SimpleTopicModel

Easily identifying themes in text

image

What is this?

This is a package that wraps up common theme identification (Topic Modeling) techniques in Python. SimpleTopicModel is currently under development, and subject to change.

How do I get it?

Currently, you can git clone this repo and import it locally. Be sure to run pip install -r requirements.txt in the repo folder, to ensure you've got the relevant requirements.

I'm working on setting up a pypi release, slated for the near future.

How do I use it?

image Use couldn't be easier. Most topic modeling techniques follow the same paradigm:

  1. Convert your text to numbers (embeddings): The excellent Sentence-Transformers package does this for us, using Microsoft's Mini-LM model.
  2. Reduce Dimensionality: This package uses UMAP, but you could substitute TSNE or PCA if you wanted to.
  3. Cluster: We're using HDBSCAN to build hierarchial clusters (which we'd like to traverse in a later release), but you could also use a KNN, GMM, etc.
  4. Visualize (Optional): This displays the reduced dimension embeddings in 3d (or 2d) space, so you can get a feel for how "tight" the clusters are.

What's next?

  • Clean up/professionalize this repo & releases
  • Add automated cluster naming techniques (cTF-IDF, LLM-assisted naming, etc)
  • Make a sweet logo & eyecatching graphics
  • Fix the docs page

Acknowledgements:

This builds on previous work including Gensim (LDA), BERTopic, Top2Vec, and pyLDAvis. They're all excellent, more mature alternatives to SimpleTopicModel, and I'd encourage you to go check them out!