Analyzing-Twitter-Trends-On-COVID-19-Vaccinations

A quantitative study comprising Twitter discussions and thematic analysis for COVID-19 vaccines
Paper published at: https://infodemiology.jmir.org/2022/1/e33909

Background
Objective
Tools and Packages
Data Collection
Data Pre-Processing
Data Modeling
Data Visualization
Results
Conclusion
References
Challenges and Future Work

BACKGROUND

The COVID-19 pandemic has killed 3.2 million people and infected 150 million around the world as of April 30, 2021. Growing human rights concerns, vaccine movements, and skepticism towards the vaccines, its effects and efficacy have resulted in a multitude of conversations on social media and the process of vaccination becoming a complicated task. No major studies have been conducted to analyze people’s perception of COVID-19 vaccines on social media for the year 2021

OBJECTIVE

To extract information from tweets (between January and April, 2021) related to COVID vaccine where opinions are highly unstructured, heterogeneous and are either positive or negative, or neutral and identify driving factors for the change in sentiments

To explore conversations and abstract "topics" that occur in the collected tweets using topic modeling and text analytics backed by breakthrough events in the timeline

To visualize the trends in sentiments of Twitter users and popularity associated with the discovered topics

TOOLS

Task	Technique	Tools/Packages Used
Data Collection	Tweet extraction from Twitter	snscrape
Data Pre-processing	Removed punctuation, stopwords, URLs, emojis, lemmatization	re, nltk,CountVectorizer, pandas, numpy
Data Modeling	Unsupervised LDA	pyLDAvis.sklearn, LatentDirichletAllocation, sklearn
Text Analytics	Topic Modeling, Sentiment analysis	vaderSentiment, corextopic
Data Visualization	Multi-attribute plots	matplotlib, seaborn, Tableau, wordcloud
Environments & Platforms		MS Excel, Google Colab, Jupyter Notebook, Twitter

DATA-COLLECTION

Method	Notes
Tweepy	3200 tweets; no historical data
GetOldTweets3	Twitter has removed the endpoint the GetOldTweets3 uses
TWINT	Twitter throws a more strict device + IP-ban after a certain amount of queries
snscrape	Scrapped 100K tweets - 96,641 English tweets
Octoparse (software)	Very time consuming with the event loop

Data Collection: Identifying COVID-19 Vaccines Content

Package used: snscrape

Language: English

Keywords: covid vaccine

Timeframe: January 1, 2021 to March 31, 2021

Number of tweets collected = 2.74 million

January - 884,011 tweets | February - 800,008 tweets | March - 1,127,854 tweets

No null values identified

Data Coverage:

With covid vaccine as the search terms, we believe that our set of keywords provides reasonable coverage and is representative of tweets communicating about COVID-19 vaccines
Individual tweets = 2.1 million
Organizational tweets = 0.59 million

DATA-PREPROCESSING

Data Cleaning

Removed punctuation using remove_punct function with library re

Removed URLs and emojis in Tokenization using library re

Removed stopwords using nltk

Lemmatization of Tweets using nltk.WordNetLemmatizer()

Individual vs Organizational Tweets

Created a Bag-of-Words with ~175 keywords to filter on Display Names

Removed 22% of the data

2,109,427 tweets remain after removing organizational accounts

Assigned week numbers (1 to 12) to the dataset

DATA-MODELING

Unsupervised LDA

To understand the abstract topics hidden in the tweets unsupervised LDA technique was implemented using the library 'pyLDAvis'. We discovered 18 different topics considering the cluster size and no overlapping amongst the clusters

Sentiment Analysis

Sentiment analysis is a supervised machine learning problem with different types of analysis. We considered a fine-grained sentiment classification with five levels of sentiments - overly positive, positive, neutral, negative, and overly negative. We used VADER (Valence Aware Dictionary for Sentiment Reasoning) as a rule-based model to examine the impact of COVID-19 vaccine on the attitude of Twitter users during the pandemic.

CorEx

Correlation Explanation (CorEx) provides a flexible framework for learning topics that are maximally informative about a corpus of text. Through anchor words, we seeded and guided the topic model towards topics of substantive interest, which allowed us to interact with and refine topics in a way that is not possible with traditional topic models. Normalized Topic Correlation (NTC) represents the correlations within an individual document explained by a particular topic.

DATA-VISUALIZATION

Unsupervised LDA

Trends in Sentiment Analysis

Distribution of Sentiments

Vaccine Conversation Trends

RESULTS

Discovered 13 unique topics from the tweets across 12 weeks from Jan’21 to Mar’21

February had the lowest number of tweets (594,050) as compared to January (695,890) and March (819,487) about COVID vaccinations

Positive sentiment contributed the most in overall sentiments of Twitter users (732,395), followed by neutral (579,493) and negative (525,866) sentiments

People were discussing the most about topics like Vaccination status, Travel, Pandemic, Politics, Vaccine Approval

Topics that remained underrepresented were People Against Vaccine, Political and COVID leaders, Who Got Vaccinated

CONCLUSION

This study focused on demonstrating the conversations around COVID-19 vaccines on Twitter using a dataset created with tweets from individuals leveraging Machine Learning and Text Analytics approach. We performed exploratory data analysis using Unsupervised LDA to identify initial implicit topics. The dataset was further analysed for positive and negative sentiments. We further performed topic modeling for a deeper understanding of topics and their popularity across time.

REFERENCES

Praveen SV, Ittamalla R, Deepak G. Analyzing the attitude of Indian citizens towards COVID-19 vaccine - A text analytics study. Diabetes Metab Syndr. 2021 Mar-Apr;15(2):595-599. doi: 10.1016/j.dsx.2021.02.031. Epub 2021 Feb 27. PMID: 33714134; PMCID: PMC7910132

DeVerna, M., Pierri, F., Truong, B., Bollenbacher, J., Axelrod, D., Loynes, N., . . . Bryden, J. (2021, April 20). CoVaxxy: A collection of ENGLISH-LANGUAGE Twitter posts About COVID-19 Vaccines

K. Hazel Kwon, J. Hunter Priniski & Monica Chadha (2018): Disentangling User Samples: A Supervised Machine Learning Approach to Proxy-population Mismatch in Twitter Research, Communication Methods and Measures, DOI: 10.1080/19312458.2018.1430755

Scraping Tweets with snscrape - https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af

Vader Sentiment Analysis - https://github.com/cjhutto/vaderSentiment

Unsupervised LDA - https://www.kaggle.com/keitazoumana/topic-modeling-with-lda

CHALLENGES-AND-FUTUREWORK

Challenges : Identifying package for tweet scraping and recognizing limitations on extraction, large execution times and runtime errors due to memory limitation for parts of data modeling

Future Work

Low impact insights from VADER Sentiment Analysis opens up a scope for deep dive into topics independently like People For/Against vaccines

Explore conversations and sentiments in organizational tweets

Number of active COVID cases, recoveries and deaths for the three months

This project was made in collaboration with Harsh Shah and Vivek Kumar, do check out some of the amazing projects they've worked on.

rashidesai24/Analyzing-Twitter-Trends-On-COVID-19-Vaccinations

Analyzing-Twitter-Trends-On-COVID-19-Vaccinations

TABLE OF CONTENTS

BACKGROUND

OBJECTIVE

TOOLS

DATA-COLLECTION

Data Collection: Identifying COVID-19 Vaccines Content

Data Coverage:

DATA-PREPROCESSING

DATA-MODELING

Unsupervised LDA

Sentiment Analysis

CorEx

DATA-VISUALIZATION

Unsupervised LDA

Trends in Sentiment Analysis

Distribution of Sentiments

Vaccine Conversation Trends

Popular Topics

RESULTS

CONCLUSION

REFERENCES

CHALLENGES-AND-FUTUREWORK

Future Work