Covid vs Climate on Twitter

Small data exploration project.

Current Features:

Explore hashtag usage over time
Some fun interactive UMAP Vis
[TODO] Explore Tweet similarities over time (VSM)
[TODO] Explore Tweet similarities over time (Embeddings)
[TODO] Some fun AlignedUMAP Animation

For clean execution. Ignore the rest of the repository and start with the pipeline folder. The other code is exploratory any may not be compatible anymore. utils is mostly maintained and is used in the pipeline.

Note: May require Python 3.9+

FrankenTopic aka BERTopic reloaded

UMAP vs tSNE

Using tSNE with $\alpha$<1 inspired by Dmitry Kobak instead of UMAP. This improves the clusterability of the two-dimensional projection, since UMAP apparently just creates a big blob in the middle forming a single large cluster.

HDBSCAN vs kMeans

For speed, use kMeans, also has clear control over the number of clusters and no tweets are outliers. Has its own shortcomings though. To mitigate this, FrankenTopic has a mode where clusters with too few tweets are dumped. An intuitive setting would be to set it to (num_tweets/num_clusters)/2 or so.
See settings min_docs_per_topic and max_n_topics.

Very informative documentation about how to pick HDBSCAN parameters here!

Notes for the future

Check out HDBSCAN Soft Clustering, maybe we can use this to simulate cluster affinity or even some sort of topic distribution (after all – the "H" stands for hierarchical, and the softness stuff provides a score of how much something fits to a cluster aka topic).

Hashtags

Small playground to get a feel for streamlit

# To start, run
cd hashtags/
streamlit run hashtags.py