A set of 2000 topics captured in over 14 million Facebook status updates derived via Latent Dirichlet Allocation (LDA).
- Top 20 words per topic
- All words
- Conditional probabilities (sparse matrix format)
The goal is to get the probability of a topic given the document:
For example, let's say we have topics with the following condition probabilities for words:
topic 1: a: 0.01, b: 0.02, c: 0.001
topic 2: c: 0.02, d: 0.005
and two documents with the following frequencies of words:
document 1: a: 2, b: 10, c: 3, d: 0, e: 6, f: 4
document 2: a: 5, b: 3, c: 8, d: 4, e: 0, f: 10
therefore the total word use in the documents are:
document 1: 2 + 10 + 3 + 0 + 6 + 4 = 25
document 2: 5 + 3 + 8 + 4 + 0 + 10 = 30
document 1's use of topics is given by summing the weighted relative frequencies:
p(topic1|document1): (2/25)*0.01 + (10/25)*0.02 + (3/25)*0.001 = 0.00892
p(topic2|document1): (3/25)*0.02 + (0/25)*0.005 = 0.0024
while document 2's use of topics:
p(topic1|document2): (5/30)*0.01 + (3/30)*0.02 + (8/30)*0.001 = .00393
p(topic2|document2): (8/30)*0.02 + (4/30)*0.005 = 0.006
Read the full publication here.
@article{schwartz2013personality,
title={Personality, gender, and age in the language of social media: The open-vocabulary approach},
author={Schwartz, H Andrew and Eichstaedt, Johannes C and Kern, Margaret L and Dziurzynski, Lukasz and Ramones, Stephanie M and Agrawal, Megha and Shah, Achal and Kosinski, Michal and Stillwell, David and Seligman, Martin EP and others},
journal={PloS one},
volume={8},
number={9},
pages={e73791},
year={2013},
publisher={Public Library of Science}
}
Licensed under a GNU General Public License v3 (GPLv3).
Developed by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University.