County Tweet Lexical Bank

County level word and topic loading derived from a 10% Twitter sample from 2009-2015. Anonymized linguistic features extracted from over 1.5 billion English U.S County mapped tweets.

Read the full publication here.

Data

Available in both csv format and as a MySQL dump. All tables are in sparse (long) format.

Unigrams

Approximately 24,000 most frequenct unigrams. All urls replaced with <URL> and @-mentions replaced with <USER>.

group_id: County FIPS code
feat: unigram
value: Number of times the unigram was used by the county
group_norm: Average number of times the feature was used by the county (value / number of users in county)

Facebook Topics

Topic loadings per county using a set of 2000 topics captured in over 14 million Facebook status updates derived via Latent Dirichlet Allocation (LDA) (see full details on topic derivation). Topics, words per topic and conditional probailities available here.

group_id: County FIPS code
feat: Topic id
value: Number of times a word in the topic was used by the county
group_norm: Relative frequency of topic use by county

Data Processing

Twitter data was processed using the following rules:

Each tweet was mapped to a U.S. County using tweet level latitude / longitude information and user level profile free text (full details here).
Filtered for English using langid.
Users with less than 30 tweets were removed.
Counties with less than 100 users were removed.

Linguistic features process:

Unigram relative frequencies extracted for each user.
User level relative frequencies are averaged to the county.
Topic loadings calculated using county level unigram relative frequencies.

Citation

Please cite the following paper if you use this data.

@inproceedings{giorgi2018remarkable,
    title={The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions}, 
    author={Giorgi, Salvatore and Preotiuc-Pietro, Daniel and Buffone, Anneke and Rieman, Daniel and Ungar, Lyle H. and Schwartz, H. Andrew}, 
    booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, 
    year={2018}
}

License

Licensed under a GNU General Public License v3 (GPLv3).

Background

Developed by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University.

hschwartz/county_tweet_lexical_bank