COVID-19-Tweet-Classification-Challenge-by-ZindiWeekendz

ML-driven sentiment analysis is an important tool to understand communities’ feelings around major issues such as COVID-19. Gathering comprehensive social data for sentiment analysis can be limited, however, if data collection relies only on keywords such as ‘coronavirus’ or ‘covid’.

The objective of this challenge is to develop a machine learning model to assess if a Twitter post is about COVID-19 or not. This model will help gather tweet data about the epidemic without relying only on key words like ‘covid’ or ‘coronavirus’ being present, allowing researchers and engineers to gather a more comprehensive dataset for sentiment analysis.

This model could be put into practice as part of a larger effort to understand online sentiment around COVID-19, and inform future communications and public interventions by governments and non-government public health organisations.

The objective of this challenge is to develop a machine learning model to assess if a Twitter post is about covid-19 or not. The data used for this challenge has been collected by the Zindi team via Twitter API from tweets over the past year. The are ~7,000 tweets in the train set and ~3,000 in the test set.

Tweets have been classified as covid-19-related (1) or not covid-19-related (0). All tweets have had the following keywords removed:

corona coronavirus covid covid19 covid-19 sarscov2 19 The tweets have also had usernames and web addresses removed to ensure anonymity.