A quantitative study comprising Twitter discussions and thematic analysis for COVID-19 vaccines
Paper published at: https://infodemiology.jmir.org/2022/1/e33909
- Background
- Objective
- Tools and Packages
- Data Collection
- Data Pre-Processing
- Data Modeling
- Data Visualization
- Results
- Conclusion
- References
- Challenges and Future Work
The COVID-19 pandemic has killed 3.2 million people and infected 150 million around the world as of April 30, 2021. Growing human rights concerns, vaccine movements, and skepticism towards the vaccines, its effects and efficacy have resulted in a multitude of conversations on social media and the process of vaccination becoming a complicated task. No major studies have been conducted to analyze people’s perception of COVID-19 vaccines on social media for the year 2021
Task | Technique | Tools/Packages Used |
---|---|---|
Data Collection | Tweet extraction from Twitter | snscrape |
Data Pre-processing | Removed punctuation, stopwords, URLs, emojis, lemmatization | re, nltk,CountVectorizer, pandas, numpy |
Data Modeling | Unsupervised LDA | pyLDAvis.sklearn, LatentDirichletAllocation, sklearn |
Text Analytics | Topic Modeling, Sentiment analysis | vaderSentiment, corextopic |
Data Visualization | Multi-attribute plots | matplotlib, seaborn, Tableau, wordcloud |
Environments & Platforms | MS Excel, Google Colab, Jupyter Notebook, Twitter |
Method | Notes |
---|---|
Tweepy | 3200 tweets; no historical data |
GetOldTweets3 | Twitter has removed the endpoint the GetOldTweets3 uses |
TWINT | Twitter throws a more strict device + IP-ban after a certain amount of queries |
snscrape | Scrapped 100K tweets - 96,641 English tweets |
Octoparse (software) | Very time consuming with the event loop |
Individual tweets = 2.1 million
Organizational tweets = 0.59 million
Data Cleaning
Individual vs Organizational Tweets
To understand the abstract topics hidden in the tweets unsupervised LDA technique was implemented using the library 'pyLDAvis'. We discovered 18 different topics considering the cluster size and no overlapping amongst the clusters Sentiment analysis is a supervised machine learning problem with different types of analysis. We considered a fine-grained sentiment classification with five levels of sentiments - overly positive, positive, neutral, negative, and overly negative. We used VADER (Valence Aware Dictionary for Sentiment Reasoning) as a rule-based model to examine the impact of COVID-19 vaccine on the attitude of Twitter users during the pandemic. Correlation Explanation (CorEx) provides a flexible framework for learning topics that are maximally informative about a corpus of text. Through anchor words, we seeded and guided the topic model towards topics of substantive interest, which allowed us to interact with and refine topics in a way that is not possible with traditional topic models. Normalized Topic Correlation (NTC) represents the correlations within an individual document explained by a particular topic.
This study focused on demonstrating the conversations around COVID-19 vaccines on Twitter using a dataset created with tweets from individuals leveraging Machine Learning and Text Analytics approach. We performed exploratory data analysis using Unsupervised LDA to identify initial implicit topics. The dataset was further analysed for positive and negative sentiments. We further performed topic modeling for a deeper understanding of topics and their popularity across time.
Challenges : Identifying package for tweet scraping and recognizing limitations on extraction, large execution times and runtime errors due to memory limitation for parts of data modeling
This project was made in collaboration with Harsh Shah and Vivek Kumar, do check out some of the amazing projects they've worked on.