Table of Contents |
---|
Potential Projects |
General Resources |
High Dimensional Data |
Potential Projects
- Debiasing word embeddings (see Man is to Computer Programmer as Woman is to Homemaker?]
General Resources
- Jonathan Stray's Frontiers of Computational Journalism course
- Jun Yang's publications
- Information Retrieval Book
Reporting on society, using computation, and reporting on computation in society.
High Dimensional Data
Vector-izing data, then projecting it into K << R (typically K=2 or K=3)
Text analysis in Journalism
- Clustering, classification
- Document Vector Space Model: what is this document about?
- finding important words, topic analysis, key component for filtering
- features = words works fine. One dimension = vocabulary of a document, value of a dimension = # of times word appears
- Each entry becomes term frequency: tf(t,d)
- distance metric for text (how similar? clustering?)
- Looking for overlapping terms: Cosine similarity. similary(a,b) = (a \dot b) / (mag(a) \times mag(b)) = cos(theta). Cosine distance is just 1 - similarity(a,b).
- also ignore stopwords and "de-weight" common words (e.g. "car" in car reviews). Document frequency
df(t,D)
= fraction of docs containing term- Inverse document frequency idf(t, D) = log(|D| / |d \in D : t \in d|)
- TF-IDF = tf(t,d) * idf(d, D) = term frequency * 1 / document frequency