Shiro Kuriwaki (Stanford University)
Last presented: October 2021
(Sample problem set in pset
folder)
“Survey weighting is a mess.” – Gelman (2007)
Reweighting your data to make it representative of a population of interest is a ubiquitous operation in data science and survey research. The procedure has fundamental connections to classical statistical theory. And recently, the connection between survey inference to causal inference is resurging: Selection bias has become prominent, semi-opt-in online surveys are the norm, and traditional methods devised for researcher-designed sampling have become less relevant. One might refer to this change in the practice of surveying as the “modern” reality. However, survey weighting is not taught in standard political science graduate training. Here I walk through these core concepts taking the CCES as an example. I draw connections to concepts which are often covered in political science training, like causal inference or machine learning. I presume a level of familiarity with a 1st year PhD methods class covering bias-variance, OLS, and some research design.
Download this repo by
creating
a Rstudio Project of it (From Version Control Repository, Github,
"https://github.com/kuriwaki/modern-weighting-2021"
).
Start a new R script (or Rmd if you prefer) in the same repository, root.
Load libraries
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(scales) # Scale Functions for Visualization
library(survey) # Analysis of Complex Survey Samples
library(autumn) # Fast, Modern, and Tidy-Friendly Iterative Raking
library(lme4) # Linear Mixed-Effects Models using 'Eigen' and S4
Read in data as
poll <- read_rds("data/poll.rds")
frame <- read_rds("data/pop_microdata.rds")
poll is a survey of 1,000 people. It is a biased sample of the population, which we call frame (the sampling frame) here. Many times in a survey context we do not have a decent frame available, but we will use it here
Y
is a binary variable of outcome of interest (e.g. Yes to some public opinion question). It is a synthetic variable I made up; the content is not important here. It is not observed in the sampling frame — the point of a poll is that you only observe the data of interest in your sample, but you care about generalizing to the population.S
is an indicator for Selection / Survey Inclusion. It is recorded in the frame. It is 1 if the item in the frame ended up in the survey, 0 otherwise.ID
is a common numeric ID for merging.- All other variables follow the CCES / Cumulative CCES
How to read this document. The preview on github is a compiled version
of README.Rmd. The README.Rmd has the code to produce all the results,
but the preview only shows the results. If you need to check the
answers, you can check the Rmd. Alternatively, you can convert the
solutions as a R script with knitr::purl("README.Rmd")
.
“The purpose of poststratification is to correct for known differences between sample and population.” — Gelman (2007)
Q: What is the distribution of education in the sample? In the population?
#> # A tibble: 4 × 3
#> educ n frac
#> <dbl+lbl> <int> <chr>
#> 1 1 [HS or Less] 166 17%
#> 2 2 [Some College] 309 31%
#> 3 3 [4-Year] 313 31%
#> 4 4 [Post-Grad] 212 21%
#> # A tibble: 4 × 3
#> educ n frac
#> <dbl+lbl> <int> <chr>
#> 1 1 [HS or Less] 2717 27%
#> 2 2 [Some College] 3548 35%
#> 3 3 [4-Year] 2323 23%
#> 4 4 [Post-Grad] 1412 14%
Q 1.1: What are the weights that correct for this imbalance?
#> # A tibble: 4 × 3
#> educ n wt_frac
#> <dbl+lbl> <dbl> <dbl>
#> 1 1 [HS or Less] 272. 0.272
#> 2 2 [Some College] 355. 0.355
#> 3 3 [4-Year] 232. 0.232
#> 4 4 [Post-Grad] 141. 0.141
Q 1.2: What is the weighted proportion of Y
? Is that closer to the
population proportion of Y than the unweighted proportion?
Q 1.3: Extension: Explain how you would do the same reweighting but for education and race.
- What is the distribution of race and education in the population?
#> as_factor(race)
#> as_factor(educ) White Black Hispanic Asian All Other Sum
#> HS or Less 0.21 0.03 0.02 0.00 0.01 0.27
#> Some College 0.25 0.05 0.03 0.01 0.02 0.36
#> 4-Year 0.16 0.03 0.02 0.01 0.01 0.23
#> Post-Grad 0.11 0.01 0.01 0.01 0.01 0.15
#> Sum 0.73 0.12 0.08 0.03 0.05 1.01
- In the survey?
#> as_factor(race)
#> as_factor(educ) White Black Hispanic Asian All Other Sum
#> HS or Less 0.14 0.01 0.00 0.00 0.00 0.15
#> Some College 0.24 0.03 0.02 0.00 0.01 0.30
#> 4-Year 0.22 0.05 0.03 0.01 0.01 0.32
#> Post-Grad 0.17 0.01 0.01 0.01 0.01 0.21
#> Sum 0.77 0.10 0.06 0.02 0.03 0.98
Q 1.4: Estimate those weights. How does it differ from the previous weights?
Q 1.5: What are some roadblocks / issues with doing poststratification everywhere?
Q 2.1: Now suppose you did NOT know the population joint distribution of race x education, but you knew the marginals. Suppose that
- White (1): 72%
- Black (2): 12%
- Hispanic (3) : 10%
- Asian (4): 3%
- Other (5): 3%
and
- HS or Less (1): 40%
- Some College (2): 35%
- 4-Year (3): 15%
- Post-grad (4): 10%
Using raking, create weights that satisfy these marginal distributions.
#> Joining, by = c("ID", "weight", "weight_cumulative")
Q 2.2: Intuitively, what is the assumption we need to make for raking to give the same answer as poststratification?
“It is not always clear how to use weights in estimating anything more complicated than a simple mean or ratios, and standard errors are tricky even with simple weighted means.” — Gelman (2007)
Q 3.1 : What is a standard error of a survey estimator? How is it different from the standard deviation / variance of your estimates?
Q 3.2 : What do you need to compute the MSE (or RMSE) of an estimate? What are the components?
Q 3.3: Does weighting tend to increase or decrease the standard error of the estimator? The effective sample size? The design effect? Why?
“Regression modeling is a potentially attractive alter- native to weighting. In practice, however, the poten- tial for large numbers of interactions can make regres- sion adjustments highly variable. This paper reviews the motivation for hierarchical regression, combined with poststratification, as a strategy for correcting for differences between sample and population. — Gelman (2007)
Let’s treat the values as factors now
poll_fct <- poll %>%
mutate(educ = as_factor(educ), race = as_factor(race))
frame_fct <- frame %>%
mutate(educ = as_factor(educ), race = as_factor(race))
tgt_fct <- frame_fct %>%
count(educ, race)
Q 4.1: Using a logit, what are the predicted values of the outcome in each of the poststratification cells?
#> # A tibble: 20 × 4
#> educ race n Ypred
#> <fct> <fct> <int> <dbl>
#> 1 HS or Less White 2137 0.451
#> 2 HS or Less Black 275 0.500
#> 3 HS or Less Hispanic 182 0.400
#> 4 HS or Less Asian 36 0.00000349
#> 5 HS or Less All Other 87 0.250
#> 6 Some College White 2468 0.375
#> 7 Some College Black 538 0.394
#> 8 Some College Hispanic 296 0.526
#> 9 Some College Asian 74 0.333
#> 10 Some College All Other 172 0.286
#> 11 4-Year White 1611 0.391
#> 12 4-Year Black 280 0.438
#> 13 4-Year Hispanic 216 0.621
#> 14 4-Year Asian 122 0.538
#> 15 4-Year All Other 94 0.250
#> 16 Post-Grad White 1071 0.422
#> 17 Post-Grad Black 106 0.533
#> 18 Post-Grad Hispanic 64 0.300
#> 19 Post-Grad Asian 114 0.250
#> 20 Post-Grad All Other 57 0.500
Q 4.2: What is the “MRP” estimate for Y in the population then?
#> # A tibble: 1 × 1
#> Ypred
#> <dbl>
#> 1 0.412
Q 4.3: What are the issues with a simple logit?
“Contrary to what is assumed by many theoretical statisticians, survey weights are not in general equal to inverse probabilities of selection but rather are typically constructed based on a combination of prob- ability calculations and nonresponse adjustments.” — Gelman (2007)
Q 5.1: Using the population data and matching on ID, estimate a propensity score (probability of selection).
S
is a indicator in the frame which indicates the selection / survey cases. 1 means the observation is also inpoll
, 0 otherwise.
#> # A tibble: 10 × 4
#> ID race educ Spred
#> <chr> <fct> <fct> <dbl>
#> 1 307933 Hispanic 4-Year 0.0956
#> 2 323528 White Post-Grad 0.176
#> 3 301197 White Some College 0.0985
#> 4 291737 White Some College 0.135
#> 5 270162 White 4-Year 0.147
#> 6 276240 Black Some College 0.0851
#> 7 288696 Hispanic HS or Less 0.0499
#> 8 297716 Black Post-Grad 0.121
#> 9 301889 All Other Some College 0.0546
#> 10 283893 White Some College 0.0763
Q 5.2: What are the issues in Propensity Score?
Note: Links to Causal Inference and the Weighting vs. Matching Distinction
- Coarsened Exact Matching
- Balance Test Fallacy
- See Gary’s talk on “Why We Shouldn’t We use Propensity Scores for Matching”. Propensity scores can be used for other things other than matching (like weighting), but some of the same problems crop up.
Q 5.3: Compute a Covariate Balancing Score Weights from the small frame, and same with ebal scores
Load packages and start with a small sample.
library(ebal)
library(CBPS)
select <- dplyr::select # package conflict
# matrix version of what we are about to fit
frame_X <- model.matrix(~ educ + race + state, frame_fct)[, -1]
fit_cbps <- CBPS(S ~ race + educ + state, data = frame_fct, ATT = 0, iterations = 100)
fit_ebal <- ebalance(Treatment = frame_fct$S, X = frame_X)
#> Converged within tolerance
- Survey inference is causal inference where the treatment is selection
- Total Error is Bias^2 + Variance
- Many things work “in theory” (asymptotically) but cause variance problems in practice: post-stratification, inverse propensity score weighting
- Shrinkage and regularization (random effects, ML) reduces variance at the cost of minimal variance
- Balancing scores guarantees balance on some marginals, while minimizing distance on others.
- Use
tidycensus
to get summary stats for the ACS, decennial Census - For CCES, I have designed a package
cceMRPprep
for commonly used functions. - Use the
dataverse
package to download Dataverse datasets by code.
- Devin Caughey, Adam Berinsky, Sara Chatfield, Erin Hartman, Eric
Schickler, and Jas Sekhon.
2020. “Target Estimation
and Adjustment Weighting for Survey Nonresponse and Sampling Bias”.
Elements in Quantitative and Computational Methods for the Social
Sciences
- A good overview, up to date with modern survey sampling, and has enough math to clarify what goes on under the hood.
- Kosuke Imai’s Lecture
slides
on Weighting (PhD level Causal Inference)
- Very condensed slide-level overview: IPW, Hajek, CBPS, ebal
- Andrew Gelman.
2007.
“Struggles with Survey Weighting and Regression Modeling”,
Statistical Science
- The intro is worth reading for general understanding. The main text actually turns into a proposal for what is essentially MRP.
- Douglas Rivers.
2007.
“Sampling for Web Surveys”
- One of the first papers that outlined the new changes for web, opt-in surveys.
- Paul Rosenbaum, Donald Rubin.
1983.
“The central role of the propensity score in observational studies
for causal effects”. Biometrika
- One canonical reference for the idea of propensity score
- Kosuke Imai, Gary King, Elizabeth Stuart.
2008.
“Misunderstandings between experimentalists and observationalists
about causal inference”, JRSS A.
- Also see: Gary King. 2007. “Why Propensity Scores Should Not Be Used for Matching”, Methods Colloquium Talk. (Article with Rich Nielsen).
- Kosuke Imai, Marc Ratkovic.
2014.
“Covariate balancing propensity score”, JRSS B.
- Introduces
CBPS
- Introduces
- Jens Hainmueller.
2012. “Entropy
Balancing for Causal Effects: A Multivariate Reweighting Method to
Produce Balanced Samples in Observational Studies”. Political
Analysis
- Introduces
ebal
- Introduces