/COVID-19-Arabic-Tweets-Dataset

The repository contains a collection of Arabic tweets IDs associated with the novel coronavirus COVID-19. The dataset contains Tweets' ids from 2020-01-01 to 2020-04-30. The Twitter search API was used to gather real-time tweets that contained specific keywords in the Arabic language. The dataset contains almost four millions and half Arabic tweets.

Primary LanguageJupyter NotebookOtherNOASSERTION

COVID-19-Arabic-Tweets-Dataset

The repository contains a collection of Arabic tweets IDs related to novel coronavirus COVID-19. The dataset contains Tweets ids starting from January ,2020 . The Twitter search API was used to gather real-time tweets that contained specific keywords in the Arabic language. To comply with Twitter’s Terms of Service, only the ids of the tweets are released. This dataset is for non-commercial research use only.

Data Organization

  • As of Jan 26, 2021 we have tweets from January,2020 unitl May 30, 2020 tweets. We plan to add more months in upcoming days and continuosly update this page.
  • Tweet-ID files are stored in folders that indicate the year and month of the collection
  • The Tweet-ID files contain the tweets ids, all files name have the same structure, with a prefix “COVID19-tweetID-year-month-day"

Dataset collection

  • Only tweets in Arabic language were collected from January 1,2020 to May 30, 2020.
  • The keywords.txt file contains the updated keywords along with the date we began tracing them. The Hashtags.txt files contain the hashtags that we followed in our Twitter data-set the number of tweets collected for each hashtag along with the date we began tracing them.
  • Since Twitter’s search API have a restriction on the amount of the retrieved data there are missing hours of data.
  • We provided preliminary statistics of the data-set in the associated paper to this repository. The preliminary statistics will be automatically updated with every update of the dataset.
  • For retrieving, the full object of the tweet consider the following tools Hydrator and twarc .

Dataset Statistics

The following statistics is from Tweets colected until May,30,2020.
The Number of Tweets: 6,086,085
The Number of Tweets with geolocation :3925
The Average of Tweets Collected Daily : 40573

Guideline to Hydrate

Using TWARC Notebook

To hydrate the tweets-ID from our COVID-19-Arabic-Tweets-Dataset GitHub repository you can use our Hydrate_TweetIDs_Arabic_COVID19 notebook.

  • The notebook runs on google collab
  • You are required to have a Twitter developer account

For those who prefer to use a Graphical User Interface (GUI) , We suggest using Hydrator.

Using Hydrator

To use Hydrator follow the instructions in the Hydrator GitHub repository.

For Arabic guideline on both Hydrator and our Twarc notebook check our دليل استعادة قاعدة بيانات التغريدات.

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).By using this dataset , you agree to the terms of the LICENSE, and to all Twitter’s Terms of Service, and cite our paper: https://arxiv.org/abs/2004.04315

Contact

If you have any suggestions or questions, please reach out to saraa.alqurashi on Gmail or eaanazi(AT)uqu(dot)edu(dot)sa