Twitter U.S.and India Politicians dataset

The Twitter U.S. and Indian Politicians dataset is provided for use by University of Michigan faculty, students, and their collaborators. The data is stored on Turbo at /nfs/turbo/politweets and is structured like the Twitter Decahose for U-M.

Requesting Access

UM faculty, staff, and students

Go here to apply for access. MIDAS will review your application, send you a copy of the memorandum of understanding between you and MIDAS that explains the date use terms.

The process for gaining access to the dataset is similar to the Twitter Decahose Data.

Interested folks at other institutions

Email politics-database@umich.edu with some information about your project, your timeline, and the specific data you are interested in using. We are actively developing datasets and access procedures for researchers at other institutions, but we will likely not have formal systems in place until late summer 2022.

Using the Data

CSCAR's tutorials for the Twitter Decahose are a great place to start. This data is similarly structured though it includes a different set of users.

What's in the Data

We collect tweets posted by politicians in the U.S. and India and save the JSON provided by the Twitter API. Lists of politicians are generated by NivaDuck, software developed at Microsoft Research - India for automatically identifying accounts that belong to politicians.

As of April 21, 2021, the data includes:

USA
- Number of accounts: 9608 (8994 with state metadata; all current MCs with complete metadata)
- New Accounts of Celebrities added from https://github.com/webis-de/ACL-19/tree/master/celebrity-profiling
India
- Number of political Twitter handles: 33074 (27300 with party metadata, 16027 with state metadata)
- New Handles Added from the DISMISS database

Data Collection Process

Two scripts run daily, one each for India and the U.S., to pull new tweets posted everyday by each politician in the respective lists. For India, the list of accounts includes journalists, media outlets, celebrities, and influencers. The same set for the U.S. is still being compiled and the accounts will be updated in the future.

You can view the scripts for collection in the scripts folder.

Data Structure

Data is stored in two folders, one each for India (/nfs/turbo/politweets/india) and the U.S. (/nfs/turbo/politweets/usa). Inside each, there are three folders: ~/tweets_par/, ~/logs/, ~/errors/. The ~/tweets_par/ directory contains one sub-directory for each account being tracked, named for the user's public screen_name. For example, /nfs/turbo/politweets/usa/tweets_par/realDonaldTrump/ contains tweets from former President Donald Trump. Inside every user's (politician's) directory, there are multiple compressed JSON files.

Twitter User Metadata

We are manually checking all accounts NivaDuck identified and will provide periodic metadata updates.

Metadata Fields

See the codebook for a list of metadata fields, descriptions, variable types, valid values, etc.

Here's an example of the minimum metadata:

	id	id_str	screen_name	confirmed_account_type	state	twitter_name	real_name	bioguide	office_holder	party	district	level	woman	birthday	last_updated
0	986781648	986781648	jeffsessions	1	Alabama	Jeff Sessions									4/20/21
29	1155335864	1155335864	repdonaldpayne	1	New Jersey	Rep. Donald Payne Jr	Donald Payne	P000604	1	1	10	3	FALSE	12/17/58	4/20/21
74	2970462034	2970462034	repkathleenrice	1	New York	Kathleen Rice	Kathleen Rice	R000602	1	1	4	3	TRUE	2/15/65	4/20/21

Archived metadata files are available in the metadata folder as well.

Contributors

Anmol Panda and Armand Burks wrote the scripts to collect and archive Tweets using the Twitter Public API (via tweepy). Libby Hemphill generated this documentation and manages the team who collect and update data and metadata. Evan Parres handles metadata updates, and Najmin Ahmed manually verified many state labels for 2020 election candidates.

This project is a continuation of work initiated by Joyojeet Pal and Anmol Panda at Microsoft Research India.

Funding for the staff and infrastructure were provided by

We are grateful to Ballot Ready for providing data on political candidates in the U.S.

How to Cite/Acknowledge the Data

Acknowledge the Dataset Providers

If you use the data, please cite the NivaDuck paper (bibtex below) and include the following acknowledgment:

The Twitter U.S.and India Politicians dataset is supported by the Michigan Institute for Data Science, Advanced Research Computing - Technology Services, and Consulting for Statistics, Computing & Analytics Research, and Dr. Libby Hemphill's research group in the School of Information.

Cite the NivaDuck paper

BibTeX

@inproceedings{
10.1145/3400806.3400830,
author = {Panda, Anmol and Gonawela, A’ndre and Acharyya, Sreangsu and Mishra, Dibyendu and Mohapatra, Mugdha and Chandrasekaran, Ramgopal and Pal, Joyojeet},
title = {NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States},
year = {2020},
isbn = {9781450376884},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3400806.3400830},
doi = {10.1145/3400806.3400830},
abstract = {We present a scalable methodology to identify Twitter handles of politicians in a given region and test our framework in the context of Indian and US politics. The main contribution of our work is the list of the curated Twitter handles of 18500 Indian and 8000 US politicians. Our work leveraged machine learning-based classification and human verification to build a data set of Indian politicians on Twitter. We built NivaDuck, a highly precise, two-staged classification pipeline that leverages Twitter description text and tweet content to identify politicians. For India, we tested NivaDuck’s recall using Twitter handles of the members of the Indian parliament while for the US we used state and local level politicians in California state and San Diego county respectively. We found that while NivaDuck has lower recall scores, it produces large, diverse sets of politicians with precision exceeding 90 percent for the US dataset. We discuss the need for an ML-based, scalable method to compile such a dataset and its myriad use cases for the research community and its wide-ranging utilities for research in political communication on social media. },
booktitle = {International Conference on Social Media and Society},
pages = {200–209},
numpages = {10},
keywords = {united states, india, archive, twitter, politics},
location = {Toronto, ON, Canada},
series = {SMSociety'20}
}

APA 7th

Panda, A., Gonawela, A., Acharyya, S., Mishra, D., Mohapatra, M., Chandrasekaran, R., & Pal, J. (2020). NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States. International Conference on Social Media and Society, 200–209. https://doi.org/10.1145/3400806.3400830

Reporting Issues and Getting Help

Use issues to report bugs and request changes to the collection process or metadata. We will not be providing hands-on help with the data, but we will try to answer questions if they come up. You can direct questions to politics-database@umich.edu.

casmlab/politicians-tweets