Rutgers Statistics Workshop: Mixed Effects Models for Linguists

Materials created by Judith Degen, drawing on materials by Florian Jaeger, Maureen Gillespie, Peter Graff, Dave Kleinschmidt, Roger Levy, and Victor Kuperman

Lecturer

Judith Degen -- jdegen@stanford.edu

Description

The workshop is geared towards linguists interested in analyzing data from various types of experiments (truth-value judgments, Likert scale judgments, response times, reading times, etc). We'll be focusing on regression methods, in particular mixed effects models, which have proven to be a powerful tool for analyzing linguistic data (see also Harald Baayen's book "Analyzing Linguistic Data", linked below). To get the most out of it in the short amount of time, the workshop contains a large hands-on component in which participants will have the opportunity to analyze existing datasets and bring their own.

Preparation

We will be using R in this course. To get the most out of it, please bring your laptop and come with R and RStudio installed.

If you have never used R before, I recommend working through chapters 1, 2, 4, and 5 of the Introduction to R on https://www.datacamp.com/home -- it sounds like a lot, but each "chapter" is actually just a few short exercises, and it'll get you used to writing basic R code.

Schedule

Apart from the very first session (and food sessions...) the workshop will consist of a mix of lectures on my part interwoven with practical exercises so everyone can get their hands dirty with data after the introduction of any new concept. On the first day we'll be focusing on the general concept of regression and its simplest instantiation, (mixed effects) linear regression for continuous data (e.g., response times, reading times, slider ratings). On the second day we'll turn to logistic regression for binary data (e.g., truth-value judgment data or any other binary choice) and ordinal regression ...for ordinal data (e.g., Likert scale ratings). I also want to spend a significant amount of time on data visualization with ggplot.

I'll be adding code sheets here for participants to follow along with as I finalize them.

When	What	Where	Slides / Readings / Resources
Fri 10 - 10:30	Workshop overview	Room 108	slides
Fri 11 - 12	R basics and linear regression	Room 108	slides / code / solutions
Fri 1 - 3	Mixed effects linear regression	Language Lab next door to Linguistics Department	slides / code
Fri 3 - 5:30	Individual meetings / bring your own dataset!	Language Lab or Department Basement
Sat 10 - 11	Data wrangling in R	Room 108	code
Sat 11 - 12	Mixed effects logistic regression	Room 108	slides / code
Sat 1 - 2	Common issues in MEMs & solutions	Room 108	slides / code
Sat 2 - 3	Visualizing your data: mastering ggplot	Room 108	slides / code

Resources

Books

The Bible of mixed effects models: Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.

Blogs / slides / lecture notes / videos / email lists

Florian Jaeger's excellent collection of resources for regression methods (code sheets, slides, pointers to further resources) on his HLPLab wiki. This includes Maureen Gillespie's tutorial on how to code your predictors to test different kinds of hypothesis.
Shravan Vasishth's excellent statistics lecture notes on his statistics github site
Andrew Ng's excellent Coursera course on Machine Learning for great video explanations of linear and logistic regression. You can also just watch the youtube videos directly, e.g. this one which explains the very basics of linear regression.
In class we didn't get to ordinal or multinomial regression. Here is Rune Haubo Christensen's tutorial on ordinal regression (for ordinal data like Likert Scale ratings). Here is a tutorial on multinomial regression (for unordered categorical data with more than 2 levels, like the choice between referring to a referent by a name, a pronoun, or a definite description).
Subscribe to the ling-R-lang list -- language researchers with R(egression) problems and solutions.
For help in R: try ?foo, where foo stands for the name of a function.
Data wrangling with R cheatsheet

Papers

The debate on how to choose a random effects structure:

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of memory and language, 68(3), 255-278.

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv preprint arXiv:1506.04967.

Why mixed effects logistic regression is preferable to ANOVAs over proportions:

Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59(4), 434-446.

thegricean/mem_tutorial