/r-fundamentals

R Fundamentals for Public Policy

Primary LanguageHTML

R Fundamentals for Public Policy

Binder

Fall 2022

This course introduces and demonstrates how fun and exciting data analysis with R could be. Mastering computational tools and techniques not only enable social scientists to collect, wrangle, analyze, and interpret data with less pain and more fun, but it also let them work on research projects that would previously seem impossible. Due to the time constraint, the course focuses on data wrangling: the most fundamental and time-consuming component of the data analysis workflow. After completing this course, students will know how to turn messy raw data into structured datasets using the R programming language. In addition, the course will help students develop a framework for learning computational methods and using data to make the government work for everyone and empower citizens to negotiate power.

Why R? R is free, easy to learn (thanks to tidyverse), fast (thanks to rcpp), runs everywhere, open (16,000+ packages; counting only ones available at the CRAN), and has a growing massive and inclusive community called #rstats.

I allow students to audit the course if they contact me by the first week of the course. I do not allow students to enroll in the course after the first week as this course tends to move fast.

Logistics

  • Time: Monday & Wednesday 10:20-12:00 (KST)
  • Lecture Room: TBA
  • Zoom link: Check out the EKDI site.

Teaching crew

Instructor: Dr. Jae Yeon Kim, Assistant Professor, KDI School of Public Policy and Management

Office Hours

Friday 14:00-16:00 (KST)

You can set up a 30 minutes appointment with me via this Calendly link. The appointments are booked on a rolling basis.

  • In person: Professor Kim's Office (S320)
  • Via Zoom: The Zoom meeting link will be provided when it becomes available.

There are several uses for office hours. I listed some examples below.

  1. You might wonder how computational methods apply to your research or work. In that case, I am happy to talk about computational social science and civic data science applications during office hours.
  2. You might find the course too challenging or easy. In that case, I am eager to provide you with additional learning guidelines.
  3. You can use this time to chat and help us get to know each other.

Life hacks

Please read these slides (by Professor Shana Gadarian in the Department of Political Science in the Maxwell School of Citizenship and Public Affairs, Syracuse University) to understand how to communicate professionally with me, TA, CA, and other class mates.

Textbook

I use the canonical R textbook written by Grolemund and Wickham. Wickham is the mastermind behind the tidyverse, the most popular data analysis framework in the R ecosystem. Almost all of the course contents used in this course, including the textbook, are free and accessible online so that more students can easily access those materials regardless of their backgrounds.

Readings

I use the GitHub course repository in place of readings. All course materials, including lecture notes, code demonstrations, sample data, and assignments, will be posted on this repository. The lecture notes will be provided at least 1-2 days in advance. I expect that you read them before coming to class.

Additional Resources

I am currently working on an open-access textbook titled “Computational Thinking for Social Scientists.” The book covers command-line tools, version control, data wrangling, visualization, functional programming, data product development, machine learning, and SQL. If you are interested in learning computational methods further, I recommend reading it. The course is a condensed version of the book's earlier parts.

Computer requirements

The software needed for the course is as follows:

  • Bash

  • Git

  • R and RStudio (latest versions)

  • Pandoc and LaTeX (for R markdown)

I provided an installation guideline on the GitHub repository. To avoid installation and configuration issues during class, I will make all the lecture notes using MyBinder. The binder helps the code embedded in the lecture notes be reproducible by anyone, anywhere. For the assignments and final exam, you should code in R using your own machine.

Evaluation

This is a graded class based on the following:

  • Completion of assigned homework (60%)

  • Participation (10%)

  • Final exam (30%)

Grading rubric

Note that you have 40 points for free. But you should earn the rest of them.

  • A (4.0): 95~
  • A- (3.67): 90~94.9
  • B+ (3.33): 80~89.9
  • B (3.0): 70~79.9 (if you get every assignment 100 scored, then you are here before the final; if you get every assignment 50 scored and you get the perfect score from the final, then you are here.)
  • B- (2.67): 65~69.9
  • C+ (2.33): 60~64.9
  • C (2.0): 55~59.9
  • C- (1.67): 50~54.9
  • F (0): ~44.9

Disciplinary measure

  • I don't tolerate cheating. Plagiarism is cheating. Group discussion is encouraged, but assignment submission MUST BE YOUR OWN. If your responses are exactly the same as other students, I assume you are all involved in cheating. If you want to claim innocence, please bring me CLEAR and RELEVANT evidence.

  • If it happens once, I will take off EVERY POINT related to that section in the assignment or exam. In this case, I will not use the check, check-, check+ grading system for the assignment. Instead, I will use your score.

  • If it happens more than once, I will bring this matter to the disciplinary board. See this document for the reference.

1. Assignment

Students will complete three assignments during 9 weeks of the course. The assignments provide frequent learning opportunities. Each of these assignments should be fairly short and expected to be finished within 4-5 hours of effort. You are encouraged to work in group, but the work you turn in must be your own. Unless otherwise notified, the assignments should be rendered into an HTML output using R markdown and you should submit both of them via the EKDIS course website (not the GitHub repository). In addition, the R markdown files should be reproducible on our end, in the event we want to reproduce your work. I will cover what R markdown is and how to create an HTML output in the second week of the course. The final exam uses the same output format. The assignments will be graded on a check, check-plus, check-minus standards.

2. Participation

The class participation portion of the grade can be satisfied in one or more of the following ways:

  • attending the lectures (the first class of the week; focusing more on concepts) and sections (the second class of the week focusing more on hands-on practice)
  • asking and answering questions in class
  • contributing to class discussion through the Slack workspace: You should ask questions about class material and assignments through the Slack channels so that everyone can benefit from the discussion (not personally emailing to me). We encourage you to respond to each other’s questions as well. A CA will send you an invite to the workspace after the roster is confirmed.

3. Final exam

The final exam is a take-home exam that I expect you to work on your work. On the day in the examination period, we will administer a take-home final examination covering material up to that point in the course. The exam requires applying the skills you’ve acquired throughout the course to the real world data wrangling challenge. We will provide a link where you can retrieve the final exam and data. You should provide your student ID in the process and complete the exam within 24 hours. Like the assignments, you should turn in both the HTML output and R markdown file so that we can reproduce your analysis. This format is very similar to the technical interview required for getting data science jobs, so it should be helpful for your career. I will take three things into consideration for the evaluation: reproducibility, efficiency, and readability.

Accessibility

This class is committed to creating an environment in which everyone can participate, regardless of background, discipline, or disability. If you have a particular concern, please come to me as soon as possible so that we can make special arrangements.

Course outline [Lecture notes]

1st Week

2nd Week

3rd Week

  • Getting the big picture of programming: objects and functions
  • Reading: lecture notes
  • Extra reading: Kim & Ng (2021)
  • Assignment: assignment #1 due

4th Week

  • Playing with data types: vectors, dataframes, and lists
  • Reading: lecture notes
  • Extra reading:

5th Week

  • Understanding the master framework: tidy principles
  • Reading: lecture notes
  • Extra reading:
  • Extra video: Tidy Data by Hadley Wickham

6th Week

  • Reshaping and cleaning data using tidyverse
  • Reading: lecture notes
  • Extra reading: Kim ch4

7th Week

8th Week

9th Week

  • Simplifying workflow using custom functions
  • Reading: lecture notes
  • Extra reading: Kim ch5
  • Extra video:
  • Assignment: assignment #3 due

10th Week

11th Week: Reading period

12th Week : Final exam (a 24 hour take-home exam)

Contact

  1. Course content related suggestions: create issues.

  2. Lecture/section related questions: use the Slack workspace.

  3. Logistics related personal requests: only in this case contact me via email.

In the 1 & 2 cases, other students may have similar issues. Therefore, I would like to solve these collective problems collectively.

Special thanks

This course is a remix version of PS239T originally developed by Rochelle Terman (currently Assisant Professor at the University of Chicago) then revised by Rachel Bernhard (currently Assistant Professor at the University of California-Davis). I taught PS239T three times at Berkeley (TA for Rachel, lead instructor, and co-instructor with Nick Kuipers). Other teaching materials draw from the workshops I created for D-Lab at UC Berkeley, where I was a senior data science fellow, instructor, and consultant. I also thank Dan Hopkins (Professor at the University of Pennsylvania) for sharing his political data analytics course syllabus.

This work is licensed under a Creative Commons Attribution 4.0 International License.