/Analyzing-Twitter-Trends-On-COVID-19-Vaccinations

A quantitative study comprising Twitter discussions and thematic analysis for COVID-19 vaccines

Primary LanguageJupyter Notebook

Analyzing-Twitter-Trends-On-COVID-19-Vaccinations

A quantitative study comprising Twitter discussions and thematic analysis for COVID-19 vaccines
Paper published at: https://infodemiology.jmir.org/2022/1/e33909

TABLE OF CONTENTS


BACKGROUND

The COVID-19 pandemic has killed 3.2 million people and infected 150 million around the world as of April 30, 2021. Growing human rights concerns, vaccine movements, and skepticism towards the vaccines, its effects and efficacy have resulted in a multitude of conversations on social media and the process of vaccination becoming a complicated task. No major studies have been conducted to analyze people’s perception of COVID-19 vaccines on social media for the year 2021


OBJECTIVE

  • To extract information from tweets (between January and April, 2021) related to COVID vaccine where opinions are highly unstructured, heterogeneous and are either positive or negative, or neutral and identify driving factors for the change in sentiments
  • To explore conversations and abstract "topics" that occur in the collected tweets using topic modeling and text analytics backed by breakthrough events in the timeline
  • To visualize the trends in sentiments of Twitter users and popularity associated with the discovered topics

  • TOOLS

    Task Technique Tools/Packages Used
    Data Collection Tweet extraction from Twitter snscrape
    Data Pre-processing Removed punctuation, stopwords, URLs, emojis, lemmatization re, nltk,CountVectorizer, pandas, numpy
    Data Modeling Unsupervised LDA pyLDAvis.sklearn, LatentDirichletAllocation, sklearn
    Text Analytics Topic Modeling, Sentiment analysis vaderSentiment, corextopic
    Data Visualization Multi-attribute plots matplotlib, seaborn, Tableau, wordcloud
    Environments & Platforms MS Excel, Google Colab, Jupyter Notebook, Twitter


    DATA-COLLECTION

    Method Notes
    Tweepy 3200 tweets; no historical data
    GetOldTweets3 Twitter has removed the endpoint the GetOldTweets3 uses
    TWINT Twitter throws a more strict device + IP-ban after a certain amount of queries
    snscrape Scrapped 100K tweets - 96,641 English tweets
    Octoparse (software) Very time consuming with the event loop

    Data Collection: Identifying COVID-19 Vaccines Content

  • Package used: snscrape
  • Language: English
  • Keywords: covid vaccine
  • Timeframe: January 1, 2021 to March 31, 2021
  • Number of tweets collected = 2.74 million
  • January - 884,011 tweets | February - 800,008 tweets | March - 1,127,854 tweets
  • No null values identified
  • Data Coverage:

    With covid vaccine as the search terms, we believe that our set of keywords provides reasonable coverage and is representative of tweets communicating about COVID-19 vaccines
    Individual tweets = 2.1 million
    Organizational tweets = 0.59 million

    DATA-PREPROCESSING

    Data Cleaning

  • Removed punctuation using remove_punct function with library re
  • Removed URLs and emojis in Tokenization using library re
  • Removed stopwords using nltk
  • Lemmatization of Tweets using nltk.WordNetLemmatizer()

  • Individual vs Organizational Tweets

  • Created a Bag-of-Words with ~175 keywords to filter on Display Names
  • Removed 22% of the data
  • 2,109,427 tweets remain after removing organizational accounts
  • Assigned week numbers (1 to 12) to the dataset

  • DATA-MODELING

    Unsupervised LDA

    To understand the abstract topics hidden in the tweets unsupervised LDA technique was implemented using the library 'pyLDAvis'. We discovered 18 different topics considering the cluster size and no overlapping amongst the clusters

    Sentiment Analysis

    Sentiment analysis is a supervised machine learning problem with different types of analysis. We considered a fine-grained sentiment classification with five levels of sentiments - overly positive, positive, neutral, negative, and overly negative. We used VADER (Valence Aware Dictionary for Sentiment Reasoning) as a rule-based model to examine the impact of COVID-19 vaccine on the attitude of Twitter users during the pandemic.

    CorEx

    Correlation Explanation (CorEx) provides a flexible framework for learning topics that are maximally informative about a corpus of text. Through anchor words, we seeded and guided the topic model towards topics of substantive interest, which allowed us to interact with and refine topics in a way that is not possible with traditional topic models. Normalized Topic Correlation (NTC) represents the correlations within an individual document explained by a particular topic.

    DATA-VISUALIZATION

    Unsupervised LDA


    Trends in Sentiment Analysis


    Distribution of Sentiments


    Vaccine Conversation Trends


    Popular Topics


    RESULTS

  • Discovered 13 unique topics from the tweets across 12 weeks from Jan’21 to Mar’21
  • February had the lowest number of tweets (594,050) as compared to January (695,890) and March (819,487) about COVID vaccinations
  • Positive sentiment contributed the most in overall sentiments of Twitter users (732,395), followed by neutral (579,493) and negative (525,866) sentiments
  • People were discussing the most about topics like Vaccination status, Travel, Pandemic, Politics, Vaccine Approval
  • Topics that remained underrepresented were People Against Vaccine, Political and COVID leaders, Who Got Vaccinated

  • CONCLUSION

    This study focused on demonstrating the conversations around COVID-19 vaccines on Twitter using a dataset created with tweets from individuals leveraging Machine Learning and Text Analytics approach. We performed exploratory data analysis using Unsupervised LDA to identify initial implicit topics. The dataset was further analysed for positive and negative sentiments. We further performed topic modeling for a deeper understanding of topics and their popularity across time.


    REFERENCES

  • Praveen SV, Ittamalla R, Deepak G. Analyzing the attitude of Indian citizens towards COVID-19 vaccine - A text analytics study. Diabetes Metab Syndr. 2021 Mar-Apr;15(2):595-599. doi: 10.1016/j.dsx.2021.02.031. Epub 2021 Feb 27. PMID: 33714134; PMCID: PMC7910132
  • DeVerna, M., Pierri, F., Truong, B., Bollenbacher, J., Axelrod, D., Loynes, N., . . . Bryden, J. (2021, April 20). CoVaxxy: A collection of ENGLISH-LANGUAGE Twitter posts About COVID-19 Vaccines
  • K. Hazel Kwon, J. Hunter Priniski & Monica Chadha (2018): Disentangling User Samples: A Supervised Machine Learning Approach to Proxy-population Mismatch in Twitter Research, Communication Methods and Measures, DOI: 10.1080/19312458.2018.1430755
  • Scraping Tweets with snscrape - https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af
  • Vader Sentiment Analysis - https://github.com/cjhutto/vaderSentiment
  • Unsupervised LDA - https://www.kaggle.com/keitazoumana/topic-modeling-with-lda

  • CHALLENGES-AND-FUTUREWORK

    Challenges : Identifying package for tweet scraping and recognizing limitations on extraction, large execution times and runtime errors due to memory limitation for parts of data modeling

    Future Work

  • Low impact insights from VADER Sentiment Analysis opens up a scope for deep dive into topics independently like People For/Against vaccines
  • Explore conversations and sentiments in organizational tweets
  • Number of active COVID cases, recoveries and deaths for the three months


  • This project was made in collaboration with Harsh Shah and Vivek Kumar, do check out some of the amazing projects they've worked on.