Natural Language Processing for Political Science

Workshop on text analytics for political science

Slava Mikhaylov, Institute for Analytics and Data Science, Department of Computer Science and Electronic Engineering, Department of Government, University of Essex

Overview

Data Science is an exciting new area that combines scientific inquiry, statistical knowledge, substantive expertise, and computer programming. One of the main challenges for businesses and policy makers when using big data is to find people with the appropriate skills. Good data science requires experts that combine substantive knowledge with data analytical skills, which makes it a prime area for social scientists with an interest in quantitative methods. This course focuses on one aspect of data science -- generating valuable insights from text using natural language processing.

Preparing for the course

Before the course you should:

Download and install R and RStudio on your computer.
Install Rtools for Windows machines from CRAN (https://cran.r-project.org/bin/windows/Rtools/). If you are using OS X, you will need to to install XCode, available for free from the App Store. This will install a compiler (if you don't have a compiler installed) which will be needed when installing packages from GitHub that require compilation from C++ source code.
Make sure you have at least R 3.3.3 installed. (The latest version of R, as of 23 May 2017, is 3.4.0.)
Make sure your packages are up-to-date. From the command line, run update.packages(ask = FALSE)
Install quanteda from CRAN. From the Packages pane in RStudio, or from the command line: install.packages("quanteda")
Install readtext from GitHub, following these instructions
Try creating and "knitting" an RMarkdown file. You can run the attached test.Rmd file, and if builds without error and looks like this test.html then you have successfully configured your system. If asked by RStudio, install all needed packages.
Set up a GitHub account and install necessary packages following discussion in Hadley Wickham's "R Packages" Git and GitHub chapter

Instructions for use of course materials

You have three options for downloading the course material found on this page:

You can download the materials by clicking on each link.
You can "clone" repository, using the buttons found to the right side of your browser window as you view this repository. This is the button labelled "Clone in Desktop". If you do not have a git client installed on your system, you will need to get one here and also to make sure that git is installed. This is preferred, since you can refresh your clone as new content gets pushed to the course repository. (And new material will get actively pushed to the course repository at least once per day as this course takes place.)
Statically, you can choose the button on the right marked "Download zip" which will download the entire repository as a zip file.

You can also subscribe to the repository if you have a GitHub account, which will send you updates each time new changes are pushed to the repository.

Schedule

1: Overview and introduction to data science

Notes

2: Replicability in social science

R Markdown, R Notebooks, GitHub.

Examples:

3: Text scaling models

Naive Bayes, PCA, CA.

Background

Laver, Michael, Kenneth Benoit and John Garry. 2003. "Extracting Policy Positions from Political Texts Using Words as Data." American Political Science Review 97: 311-331.
Lowe, William. 2008. "Understanding Wordscores." Political Analysis 16(4): 356-371.
Greenacre, M. (2007). Correspondence Analysis in Practice, 2nd edition. Appendix A & B.
Spirling, A. (2012), "U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784-1911." American Journal of Political Science, 56: 84–97.
Herzog, A. and K. Benoit (2015), "The most unkindest cuts: Speaker selection and expressed government dissent during economic crisis." Journal of Politics, 77(4):1157–1175.

4: Topic models

LDA, CTM, STM.

Background

David Blei (2012). "Probabilistic topic models." Communications of the ACM, 55(4): 77-84.
Blei, David, Andrew Y. Ng, and Michael I. Jordan (2003). "Latent dirichlet allocation." Journal of Machine Learning Research 3: 993-1022.
Blei, David (2014) "Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models." Annual Review of Statistics and Its Application, 1: 203-232.
Roberts, Stewart, Tingley, Lucas, Leder-Luis, Gadarian, Albertson, and Rand (2014). "Structural topic models for open-ended survey responses." American Journal of Political Science, 58(4): 1064-1082.
Blei, D. and J. Lafferty "Topic Models." In Text Mining: Classification, clustering, and applications, A. Srivastava and M. Sahami (eds.), pp 71-94, 2009. Chapter available here.

5: Word embeddings

word2vec, text2vec.

Notes

Background

Mikolov, Tomas et al. "Efficient Estimation of Word Representations in Vector Space."
Goldberg, Yoav and Omer Levy "word2vec Explained: Deriving Mikolov et al.'s Negative- Sampling Word-Embedding Method."
Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems.
Levy, Omer; Goldberg, Yoav; Dagan, Ido (2015). "Improving Distributional Similarity with Lessons Learned from Word Embeddings." Transactions of the Association for Computational Linguistics.
Pennington et al. "GloVe: Global Vectors for Word Representation."
Huang et al. "Improving Word Representations via Global Context and Multiple Word Prototypes."

6: Textual patterns and quality assessment

keyness, similarity, distance, readability

7: Project discussion

Prepare a presentation (using R Markdown) on a project that uses NLP. (Optional, group work.)
Presentation and discussion of each project.

sjankin/nlpps

Natural Language Processing for Political Science

Workshop on text analytics for political science

Overview

Preparing for the course

Instructions for use of course materials

Schedule

1: Overview and introduction to data science

2: Replicability in social science

3: Text scaling models

Background

4: Topic models

Background

5: Word embeddings

Background

6: Textual patterns and quality assessment

7: Project discussion