Fall 2022
This course introduces and demonstrates how fun and exciting data analysis with R could be. Mastering computational tools and techniques not only enable social scientists to collect, wrangle, analyze, and interpret data with less pain and more fun, but it also let them work on research projects that would previously seem impossible. Due to the time constraint, the course focuses on data wrangling: the most fundamental and time-consuming component of the data analysis workflow. After completing this course, students will know how to turn messy raw data into structured datasets using the R programming language. In addition, the course will help students develop a framework for learning computational methods and using data to make the government work for everyone and empower citizens to negotiate power.
Why R? R is free, easy to learn (thanks to tidyverse), fast (thanks to rcpp), runs everywhere, open (16,000+ packages; counting only ones available at the CRAN), and has a growing massive and inclusive community called #rstats.
I allow students to audit the course if they contact me by the first week of the course. I do not allow students to enroll in the course after the first week as this course tends to move fast.
- Time: Monday & Wednesday 10:20-12:00 (KST)
- Lecture Room: TBA
- Zoom link: Check out the EKDI site.
Instructor: Dr. Jae Yeon Kim, Assistant Professor, KDI School of Public Policy and Management
-
E-mail: jaeyeonkim@kdis.ac.kr
-
Teaching assistant: TBD
-
Class assistant: Ranci Danis (rdanis@kdis.ac.kr)
Friday 14:00-16:00 (KST)
You can set up a 30 minutes appointment with me via this Calendly link. The appointments are booked on a rolling basis.
- In person: Professor Kim's Office (S320)
- Via Zoom: The Zoom meeting link will be provided when it becomes available.
There are several uses for office hours. I listed some examples below.
- You might wonder how computational methods apply to your research or work. In that case, I am happy to talk about computational social science and civic data science applications during office hours.
- You might find the course too challenging or easy. In that case, I am eager to provide you with additional learning guidelines.
- You can use this time to chat and help us get to know each other.
Please read these slides (by Professor Shana Gadarian in the Department of Political Science in the Maxwell School of Citizenship and Public Affairs, Syracuse University) to understand how to communicate professionally with me, TA, CA, and other class mates.
I use the canonical R textbook written by Grolemund and Wickham. Wickham is the mastermind behind the tidyverse, the most popular data analysis framework in the R ecosystem. Almost all of the course contents used in this course, including the textbook, are free and accessible online so that more students can easily access those materials regardless of their backgrounds.
- Garrett Grolemund and Hadley Wickham (2016). R for Data Science
I use the GitHub course repository in place of readings. All course materials, including lecture notes, code demonstrations, sample data, and assignments, will be posted on this repository. The lecture notes will be provided at least 1-2 days in advance. I expect that you read them before coming to class.
I am currently working on an open-access textbook titled “Computational Thinking for Social Scientists.” The book covers command-line tools, version control, data wrangling, visualization, functional programming, data product development, machine learning, and SQL. If you are interested in learning computational methods further, I recommend reading it. The course is a condensed version of the book's earlier parts.
The software needed for the course is as follows:
-
Bash
-
Git
-
R and RStudio (latest versions)
-
Pandoc and LaTeX (for R markdown)
I provided an installation guideline on the GitHub repository. To avoid installation and configuration issues during class, I will make all the lecture notes using MyBinder. The binder helps the code embedded in the lecture notes be reproducible by anyone, anywhere. For the assignments and final exam, you should code in R using your own machine.
This is a graded class based on the following:
-
Completion of assigned homework (60%)
-
Participation (10%)
-
Final exam (30%)
Note that you have 40 points for free. But you should earn the rest of them.
- A (4.0): 95~
- A- (3.67): 90~94.9
- B+ (3.33): 80~89.9
- B (3.0): 70~79.9 (if you get every assignment 100 scored, then you are here before the final; if you get every assignment 50 scored and you get the perfect score from the final, then you are here.)
- B- (2.67): 65~69.9
- C+ (2.33): 60~64.9
- C (2.0): 55~59.9
- C- (1.67): 50~54.9
- F (0): ~44.9
-
I don't tolerate cheating. Plagiarism is cheating. Group discussion is encouraged, but assignment submission MUST BE YOUR OWN. If your responses are exactly the same as other students, I assume you are all involved in cheating. If you want to claim innocence, please bring me CLEAR and RELEVANT evidence.
-
If it happens once, I will take off EVERY POINT related to that section in the assignment or exam. In this case, I will not use the check, check-, check+ grading system for the assignment. Instead, I will use your score.
-
If it happens more than once, I will bring this matter to the disciplinary board. See this document for the reference.
Students will complete three assignments during 9 weeks of the course. The assignments provide frequent learning opportunities. Each of these assignments should be fairly short and expected to be finished within 4-5 hours of effort. You are encouraged to work in group, but the work you turn in must be your own. Unless otherwise notified, the assignments should be rendered into an HTML output using R markdown and you should submit both of them via the EKDIS course website (not the GitHub repository). In addition, the R markdown files should be reproducible on our end, in the event we want to reproduce your work. I will cover what R markdown is and how to create an HTML output in the second week of the course. The final exam uses the same output format. The assignments will be graded on a check, check-plus, check-minus standards.
The class participation portion of the grade can be satisfied in one or more of the following ways:
- attending the lectures (the first class of the week; focusing more on concepts) and sections (the second class of the week focusing more on hands-on practice)
- asking and answering questions in class
- contributing to class discussion through the Slack workspace: You should ask questions about class material and assignments through the Slack channels so that everyone can benefit from the discussion (not personally emailing to me). We encourage you to respond to each other’s questions as well. A CA will send you an invite to the workspace after the roster is confirmed.
The final exam is a take-home exam that I expect you to work on your work. On the day in the examination period, we will administer a take-home final examination covering material up to that point in the course. The exam requires applying the skills you’ve acquired throughout the course to the real world data wrangling challenge. We will provide a link where you can retrieve the final exam and data. You should provide your student ID in the process and complete the exam within 24 hours. Like the assignments, you should turn in both the HTML output and R markdown file so that we can reproduce your analysis. This format is very similar to the technical interview required for getting data science jobs, so it should be helpful for your career. I will take three things into consideration for the evaluation: reproducibility, efficiency, and readability.
This class is committed to creating an environment in which everyone can participate, regardless of background, discipline, or disability. If you have a particular concern, please come to me as soon as possible so that we can make special arrangements.
Course outline [Lecture notes]
- Why computational social science/civic data science?
- Reading: lecture notes
- Extra reading:
- Kim ch2
- Rogati, Monica, The AI Hierarchy of Needs, hackernoon.com, June 13 2017
- Extra videos:
- An Introduction to Computational Social Science by Matthew J. Salganik
- Using Data Science for Social Good: Examples, Opportunities, and Challenges by Rayid Ghani
- Machine Learning and Causal Inference for Policy Evaluation by Susan Athey
- Democratizing Our Data by Julia Lane
- Data Action: Using Data for a Public Good by Sarah Williams
- Citizen Behavioral Science by Nathan Matias
- Design Justice by Sasha Costanza-Chock
- You can't do data science in a GUI by Hadley Wickham
- Flexible, efficient, and reproducible data analysis workflow using R
- Reading: lecture notes
- Extra reading:
- Kim ch3
- Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D'Agostino McGowan, Romain François, Garrett Grolemund et al. "Welcome to the Tidyverse." Journal of Open Source Software 4, no. 43 (2019): 1686.
- Data Science Project Organization by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng (2021)
- Extra videos:
- Intro to the Tidyverse by Thomas Mock
- Teach the Tidyverse to Beginners by David Robinson
- Getting the big picture of programming: objects and functions
- Reading: lecture notes
- Extra reading: Kim & Ng (2021)
- Assignment: assignment #1 due
- Playing with data types: vectors, dataframes, and lists
- Reading: lecture notes
- Extra reading:
- Understanding the master framework: tidy principles
- Reading: lecture notes
- Extra reading:
- GW ch12
- Wickham, Hadley. "Tidy data." Journal of Statistical Software 59, no. 1 (2014): 1-23.
- Broman, Karl W., and Kara H. Woo. "Data organization in spreadsheets." The American Statistician 72, no. 1 (2018): 2-10.
- Extra video: Tidy Data by Hadley Wickham
- Reshaping and cleaning data using tidyverse
- Reading: lecture notes
- Extra reading: Kim ch4
- Summarizing data using tidyverse
- Reading: lecture notes
- Extra reading:
- Kim ch4
-
- Wickham, Hadley. "The split-apply-combine strategy for data analysis." Journal of Statistical Software 40, no. 1 (2011): 1-29.
- Assignment: assignment #2 due
- Visualizing data using tidyverse
- Reading: lecture notes
- Extra reading:
- GW ch3
- Wickham, Hadley. "A layered grammar of graphics." Journal of Computational and Graphical Statistics 19, no. 1 (2010): 3-28.
- Extra video:
- The beauty of data visualization by David McCandless
- Visualizing Doubt by Amanda Cox
- Simplifying workflow using custom functions
- Reading: lecture notes
- Extra reading: Kim ch5
- Extra video:
- What Makes a Good Function by Hadley Wickham
- Assignment: assignment #3 due
- Scaling up workflow using functional programming
- Reading: lecture notes
- Extra reading: Kim ch5
- Extra video:
- Managing many models with R by Hadley Wickham
- Repeat Loops by Mark Zuckerberg
-
Course content related suggestions: create issues.
-
Lecture/section related questions: use the Slack workspace.
-
Logistics related personal requests: only in this case contact me via email.
In the 1 & 2 cases, other students may have similar issues. Therefore, I would like to solve these collective problems collectively.
This course is a remix version of PS239T originally developed by Rochelle Terman (currently Assisant Professor at the University of Chicago) then revised by Rachel Bernhard (currently Assistant Professor at the University of California-Davis). I taught PS239T three times at Berkeley (TA for Rachel, lead instructor, and co-instructor with Nick Kuipers). Other teaching materials draw from the workshops I created for D-Lab at UC Berkeley, where I was a senior data science fellow, instructor, and consultant. I also thank Dan Hopkins (Professor at the University of Pennsylvania) for sharing his political data analytics course syllabus.
This work is licensed under a Creative Commons Attribution 4.0 International License.