This project was my first foray into natural language processing and topic modeling using Joe Biden's 2020 election tweets - an 18-month period from his declaration of candidacy on April 25, 2019 through October 31, 2020, right before the general election. The goal was to find topic trends over the course of the campaign, especially in the primary vs. general election and through the COVID-19 pandemic.
- Found a Kaggle set of Joe Biden's tweets.
- NLP preprocessing:
- removal of apostrophes, special characters, non-English words, etc.
- lemmatization
- removal of stop words, min_df and max_df set
- TD-IDF vectorization
- NMF topic modeling and optimization
- Visualization
- Wordcloud
- K-means clustering
- Document Type similarity with PCA and t-SNE
I found 10 pretty discrete topics in the tweets, with the top two most occurring topics being about Trump and healthcare. The 2020 election was a repudiation of Donald Trump, so it makes sense that many tweets would be about him. The consensus is that healthcare and the defending of the Affordable Care Act was a big reason Democrats won back the House of Representatibes in 2018, so it's no surprise that it continued to be the biggest policy issue in 2020, especially during the COVID-19 pandemic.
- Jupyter Notebook
- Python
- Pandas
- Matplotlib
- Wordcloud
- NLTK
- Scikit-learn
- TF-IDF vectorizer
- K-Means clustering
- Principal Component Analysis (PCA)
- t-SNE
- Natural Language Processing
- Unsupervised Learning
- Dimensionality Reduction
- Topic Modeling
- Clustering
- Visualization