
A list of Twitter datasets and related resources.


A list of Twitter datasets and related resources. If you have a resource to add to the list, feel free to open a pull request, or email me at shay.palachy@gmail.com.

The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

1   Twitter Datasets

1.1   Tweet datasets

1.1.1   Tweet ID datasets

1.2   Tweets datasets (labelled)

  • Sentiment140 - Automatically laballed; authors assume that any tweet with positive emoticons, like :), are positive, and tweets with negative emoticons, like :(, are negative.
  • Weather-sentiment
  • Crowdflower Gender Classifier Data [20k] - Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.
  • Sanders Analytics {?} [5k]- Use Internet Archive's Wayback Machine to get the data. The dataset consists of 5513 hand-classified tweets. Each tweet was classified with respect to one of four different topics.

1.3   User datasets

1.4   Lost Datasets

2   Other Lists

3   Tools

3.1   Data Collection

3.2   Analysis

4   Academic Papers

5   Articles & blog posts