Small data exploration project.
Current Features:
- Explore hashtag usage over time
- Some fun interactive UMAP Vis
- [TODO] Explore Tweet similarities over time (VSM)
- [TODO] Explore Tweet similarities over time (Embeddings)
- [TODO] Some fun AlignedUMAP Animation
For clean execution. Ignore the rest of the repository and start with the pipeline
folder.
The other code is exploratory any may not be compatible anymore. utils
is mostly maintained
and is used in the pipeline.
Note: May require Python 3.9+
Using tSNE with
For speed, use kMeans, also has clear control over the number of clusters and no tweets are outliers.
Has its own shortcomings though. To mitigate this, FrankenTopic has a mode where clusters with too few tweets are dumped.
An intuitive setting would be to set it to (num_tweets/num_clusters)/2 or so.
See settings min_docs_per_topic
and max_n_topics
.
Very informative documentation about how to pick HDBSCAN parameters here!
Check out HDBSCAN Soft Clustering, maybe we can use this to simulate cluster affinity or even some sort of topic distribution (after all – the "H" stands for hierarchical, and the softness stuff provides a score of how much something fits to a cluster aka topic).
Small playground to get a feel for streamlit
# To start, run
cd hashtags/
streamlit run hashtags.py
Open http://localhost:8501/ (or whatever port is assigned) to see the following: