/ESOC-Predoc-Training

This 10-week training program is designed to prepare incoming pre-doctoral research fellows at the Princeton Empirical Studies of Conflict (ESOC) lab with the skills needed to support faculty research projects within ESOC, SPIA, and associated departments.

Primary LanguageHTML

FDR Pre-Doctoral Training Curriculum

Academic Year 2021-2022

This version 2021-11-11

Introduction

This 10-week training program is designed to prepare incoming pre-doctoral research fellows at the Princeton Empirical Studies of Conflict (ESOC) lab with the skills needed to support faculty research projects within ESOC, SPIA, and associated departments. The program draws from online courses for core data processing and visualization skills, research training materials from partnering organizations, and materials prepared by the team at ESOC.

The goal is to expose FDR fellows to all aspects of the data-driven research process. We touch upon topics such as best practices for data management, research methodologies used in the social sciences, and production-related skills like optimizing code for publication and working with LATEX. Ultimately, these skills provide fellows with a strong analytical foundation for careers in public service and future PhD study in Political Science, Economics and related fields.

The syllabus should be viewed as a resource for the long run. The goal is not to have all the answers in 10 weeks, it’s to have confidence that you can implement and know where to go to find answers.

Acknowledgement

This curriculum was developed by Alicia Chen and Samikshya Siwakoti with guidance from Professor Jacob N. Shapiro and feedback from many colleagues. Gigs Banga provided invaluable beta testing and Yining Sun and Nilima Pisharody completed post-testing revisions. The first cohort of FDR fellow, Chris Buckley and Hanjatiana Nirina Randrianarisoa, provided excellent feedback and made many changes from their experience with the curriculum. Throughout, we draw on others’ outstanding resources and owe a particular debt to Scott Cunningham’s Causal Inference: The Mixtape. Developing this curriculum was an interdisciplinary team effort.

Bootcamp Details

All fellows should already have access to the “FDR Predoctoral Training” Dropbox (DB) folder that hosts all relevant course materials (including instructions for the three data exercises you’ll work on in the last weeks of the bootcamp). For non-fellows, we have made all the content publicly available through this GitHub repository.

Fellows should start by reading the ESOC Research Production Guide handout to learn about the lab’s research practices. Each fellow also has their own subfolder and you should set your working directory to your designated folder. This also allows you to start thinking about best practices when it comes to project and data management discussed in the handout, which we will expand upon in the last week.

Over the course of the bootcamp, we recommend placing all files in an online storage service and importing all materials from the curriculum into said-cloud storage. Each user should have their own subfolder and set working directories to their designated folder.

Users from other institutions should seek their organizations’ equivalent to the research Production Guide to learn their specific standards and practices. If users are not affiliated with any organization, we recommend looking for good research practices resources online, including those listed later in this document, or consult ESOC’s own Research Production Guide.

The majority of this program is designed to be completed asynchronously using online resources and data exercises we have created. However, we hope that users will interact and work together to better understand the course materials, when and if possible.

Prerequisites

Though the course is designed in part to develop your programming skills. Basic familiarity with languages such as STATA, R, or Python is highly desirable. If you are more versed in coding with Stata, we recommend familiarizing yourself with and/ or reviewing both Python and R resources before starting the course. If you are more versed in R or Python, we recommend going over the other language. We also recommend that you review basic functions and common packages used in your program of choice prior to the start of this bootcamp. Some useful resources to review are:

STATA

There are many outstanding set of introductory modules for STATA online:

R

We highly recommend going through Datacamp’s Data Scientist track with R, at least finishing the Introduction and Visualization in R courses:

Python

We highly recommend going through the Datacamp’s Data Scientist track with Python, at least finishing the introductory and visualization courses:

We will be focusing heavily on the Potential Outcomes framework for Causal inference in this bootcamp. As such we also recommend that prior to the start of bootcamp, you read the first chapter ofCausal Inference: The Mixtapeby Scott Cunningham and review basic math and statistics concepts provided in the resources below:

Finally, here are Princeton University resources you could explore if needed along with other additional resources you could review prior to the bootcamp:

University Resources:

Course Outline

This schedule is tentative and subject to change. The learning goals outlined are key concepts you should grasp and can start to implement in practice by the end of the week.

In some weeks, your assignment will be a week-long course that includes both readings and exercises from an online resource. In other weeks, we have created assignments that require you to submit something to us. For those weeks, make sure to read the required readings first before completing the exercise as they are designed to implement concepts covered in the readings. There are also additional resources listed for each week that are not required but are there for reference.

Note on the Capstone Project: We recommend users to begin thinking about potential research ideas and questions within the first few weeks to minimize any complications such data unavailability during the final weeks of the training. For example: brainstorm potential topics within research interests in week 1 and 2, outline research questions for two or three topics in week 3 and 4, and check data availability to develop research project in following weeks.

Week 01: Data Cleaning, Descriptive Statistics, and Visualizations

Learning Objective:

  • Learn/review effective data cleaning and management techniques.
  • Understand fundamental concepts in statistics in order to describe data (e.g., distributions, central tendency theory, bi/multivariate data, etc.).
  • Implement descriptive analysis in R/STATA/Python (e.g., learn to extract key summary information from data, etc.).
  • Understand how and why different visualisation tools are used in descriptive data analysis.

Assignment: Instructions are in the “Week 1 Assignment" folder on Dropbox.

Readings:

(a) Data cleaning:

(b) Descriptive analysis:

(c) Visualizing data

Additional Resources:

Week 02: Probability & Regressions

Learning Objective:

  • Understand the basics of linear regression and discrete choice models, and learn to run specific kind of regression models in R/STATA/Python.
  • Understand what a DAG is and be able to construct one.
  • Be able to calculate and interpret coefficients and different standard errors in regression models.
  • Be prepared to run and plot interaction terms in linear and discrete choice models.

Assignment:

Readings:

Additional Resources:

Week 03: Causal Inference

Learning Objective:

  • Understand potential outcomes framework, develop familiarity with one way of approximating the ideal experiment.
  • Be able to identify and utilize methods for estimating causal effects using matching and subclassification.
  • Think about potential research designs to improve causal inference in your capstone project.

Assignment: From Cunningham’s book: any data exercises from the assigned readings & the chapter-specific exercises from: https://mixtape.scunning.com/potential-outcomes.html

Readings: Chapters 4 and 5 of Scott Cunningham’s Causal Inference: The Mixtape, https://mixtape.scunning.com/potential-outcomes.html

Additional Resources:

Week 04: Operationalizing Regressions pt. 1

Learning Objective:

  • To be able to identify and utilize methods for estimating causal effects: regression discontinuity, instrumental variables
  • Think about potential research designs to improve causal inference in your capstone project.

Assignment: From Cunningham’s book: any data exercises from the assigned readings & any other exercises from: https://mixtape.scunning.com/teaching-resources.html

Readings:

Week 05: Operationalizing Regressions pt. 2

Learning Objective:

  • Be able to understand Panel Data structure and why fixed effects are used in panel data
  • Be able to understand when and why a Difference-in-Differences model is used
  • Think about potential research designs to improve causal inference in your capstone project.

Assignment: From Cunningham’s book: any data exercises from the assigned readings any other exercises from: https://mixtape.scunning.com/teaching-resources.html

Readings:

Week 06: Operationalizing Regressions pt.3 & Panel Data Exercise #2

Learning Objective:

  • Be prepared to work with clustered standard error.
  • Understand the basics of power analysis and identify what factors could affect statistical power.
  • Be able to interpret interaction terms in logit and probit models.
  • Practice panel data skills.

Readings:

Assignment: Instructions for Panel data exercise 2 are in the “Panel Data Exercise” folder on Dropbox. This exercise will implement many of the tools and methods we covered in earlier weeks. You should feel free to review those notes as you complete this data challenge, in particular the assigned readings for the empirical methods and operationalizing regressions weeks. As you do the exercise, remember to think about substantive effects, not just statistical significance.

Week 07: Text-as-Data & Panel Data Exercise #1

Learning Objective:

  • Build basic skills for Natural language Processing
  • Learn to clean and process unstructured text data
  • Learn to create document term matrices
  • Explore dictionary and topic modeling methods for text analysis
  • Solidify understanding of Panel Data regressions

Assignment:

  • Instructions are in the “Text-as-Data Exercise” folder on Dropbox
  • Instructions for Panel data exercise 1 are in the “Panel Data Exercise” folder on Dropbox.

We encourage you to use Python to handle Natural Language Processing. While there are many programs out there, Python has an impressive library of packages for working with text data and conducting Natural Language Processing commonly used in the social sciences. You are welcome to find your own set of resources if you prefer using a separate program (eg. R, Stata, etc.) for this section.

Readings:

Additional NLP Resources:

Robustness checks Resources:

  • Abadie, A. (2005): “Semiparametric difference-in-differences estimators” The Review of Economic Studies, 72, 1–19.
  • Belloni, A., V. Chernozhukov, and C. Hansen (2014): “High-dimensional methods and inference on structural and treatment effects” Journal of Economic Perspectives, 28, 29–50.
  • Borusyak, K., X. Jaravel, and J. Spiess (2021): “Revisiting event study designs: Robust and efficient estimation” Tech. rep., Working Paper.
  • Callaway, B. and P. H. Sant’Anna (2020): “Difference-in-differences with multiple time periods” Journal of Econometrics.
  • Marcus, M. and P. H. Sant’Anna (2021): “The role of parallel trends in event study settings: An application to environmental economics” Journal of the Association of Environmental and Resource Economists, 8, 235–275.
  • Roth, J. (2021): “Pre-test with caution: Event-study estimates after testing for parallel trends” Working paper.
  • Sant’Anna, P. H. and J. Zhao (2020): “Doubly robust difference-in-differences estimators” Journal of Econometrics, 219, 101–122.
  • Wooldridge, J. (2021): “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators” Available at SSRN 3906345.
  • De Chaisemartin, C. and X. d’Haultfoeuille (2020): “Two-way fixed effects estimators with heterogeneous treatment effects” American Economic Review, 110, 2964–96.

Week 08: Working with GIS Data & Spatial Data Exercise

Learning Objective: Basic skills for visualizing geo-spatial data.

  • Understand the basics of working with GIS data.
  • Be able to visualize the geo-spatial data with layers.
  • Practice the basics of geo-coding with google APIs.

Assignment: Instructions are in the “GIS Data Exercise” folder on Dropbox.

We encourage you to use R to handle geospatial data. While there are many programs out there, R has an impressive library of packages for working with spatial data and conducting spatial analysis commonly used in the social sciences. You are welcome to find your own set of resources if you prefer using a separate program (eg. Python, ArcGIS, Stata, etc.)for this section. While we recommend using the sf package to complete the assignment, the resources below uses a mix of both sp and sf.

Readings:

(a) Introduction to GIS basics in R:

(b) GIS in R:

(c) Additional GIS resources:

  • Fick, S. E. and R. J. Hijmans (2017): “WorldClim 2: new 1-km spatial resolu- tion climate surfaces for global land areas,” International Journal of Climatology, 37, 4302–4315. https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/joc.
  • Goldblatt, R., M. F. Stuhlmacher, B. Tellman, N. Clinton, G. Hanson, M. Georgescu, C. Wang, F. Serrano-Candela, A. Khandelwal, W.-H. Cheng, and R. C. Balling Jr (2018): “Using Landsat and nighttime lights for supervised pixel- based image classification of urban land cover,” Remote Sensing of Environment, 205. https://gps.ucsd.edu/_files/faculty/hanson/hanson_publications_landsat.pdf
  • Hansen, M. C., P. V. Potapov, R. Moore, M. Hancher, S. A. Turubanova, A. Tyukavina, D. Thau, S. V. Stehman, S. J. Goetz, T. R. Loveland, et al. (2013): “High-resolution global maps of 21st-century forest cover change,” Science, 342, 850–853. https://pubmed.ncbi.nlm.nih.gov/24233722/
  • Stevens, F. R., A. E. Gaughan, C. Linard, and A. J. Tatem (2015): “Disaggre- gating census data for population mapping using random forests with remotely- sensed and ancillary data,” PloS one, 10, e0107042. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.

Week 09 & 10: Production

Learning Objective:

  • Draw on theories and methods canvased throughout the course and see how all these skills come together.
  • Define your research question and hypotheses.
  • Explore and justify your research designs.
  • Incorporate reproducible research practices and implement the techniques.

Assignment: Finish capstone project! Questions that can be used as inspiration for capstone projects can be found in the GIS Data Exercise instructions.

Creating Efficient and Reproducible Code:

Working with LATEX:

Overleaf is a commonly used collaborative writing tool in academia (think of it as Google Docs but with a LATEX editor). Though you will likely use Overleaf for most of your work as a FDR fellow, you may also want to consider installing a TEX distribution on your computer as well. The common ones are MacTex for Mac and MikTex for Windows.

One of the most important skills to learn is to automatically convert your code results into nicely-formatted tables for publication. You should start by understanding how the “table” environment works: http://www1.maths.leeds.ac.uk/LaTeX/TableHelp1.pdf. Luckily, there are many packages available out there that can format your results into tables automatically. Here are some resources with code samples to get started:

Additional Resources

Regressions:

Data visualization:

Implementing empirical models in code:

Diff-in-diff designs: https://edge.edx.org/assets/courseware/v1/b8d2a8030b7aa5f2762a464bf7f8b0c7/c4x/BerkeleyX/CEGA101AIE/asset/Module_2.5_Difference_in_Differences.pdf

Regression discontinuity: https://edge.edx.org/assets/courseware/v1/fe1cb61a45c21910d8981c75298484d2/c4x/BerkeleyX/CEGA101AIE/asset/Module_2.4_Regression_Discontinuity.pdf

This work is licensed under a Creative Commons Attribution 4.0 International License. [https://creativecommons.org/licenses/by/4.0/].