GRAD-E1294: Applied Data Science with R

Contains slides and material for course on applied data science with R

Spring 2019

Version: 16 January 2019

General Information

Instructor: Matthias Haber
Email : haber@hertie-school.org
Work: https://github.com/mhaber

Instructor Information

Matthias Haber is working as head of data analytics at Looping Group. Previously, he was a research scientist at the Hertie School of Governance with research interests in party politics, electoral behavior, machine learning, survey experiments, and measurement problems. He holds degrees from the University of Mannheim, the University of Essex, and the University of Potsdam.

Course Contents and Learning Objectives

As data are increasingly available online, data analysis has replaced data acquisition as the bottleneck to empirical research in the social sciences. 80% of empirical research is spent sourcing, cleaning and preparing often noisy data, while the remaining 20% is actual data analysis. Extracting knowledge from heterogeneous datasets requires not only computational tools, but the programming skills to use them effectively.

This course introduces computational methods needed for data generation, data manipulation, data visualization, and data reproducibility and provides students with the ability to apply them to their own projects. The course is organized in three parts. The first part of the course will introduce ways to effectively extract, load, transform, and visualize structured and unstructured data. The second and third part will focus on practical applications of data science methods in academic research and in the industry respectively.

There is an increasing demand inside and outside of academia for skills to effectively analyze data as well as present results to a range of audiences making this course equally relevant for students seeking scientific or business careers.

The goals of the course are to learn ways to efficiently import, explore and communicate messy data from various sources and to get an overview of current industry data science solutions.

Grading and Assignments

The course requires the completion of small, weekly homework exercises and a final data project. Political science thrives of collaboration and co-authorship. Hence, the participants are allowed (but not required) to complete their homework exercises and their final projects in two-person teams. The data project is due in the final exam week.

1. Weekly Assignments

Each week students have to complete small homework exercises that allow them to directly apply the techniques they learned in class. Homework exercises contribute 6% to the final grade each and students are encouraged to complete them in pairs.

2. Final project

For the final data project students are given a large dataset and will analyze it and present their results using their own ideas and skills learned throughout the course.

Composition of the Final Grade

Name	Percent of Final Mark	Due
Weekly Assignments	60%	Tuesdays, 8 am
Final Data Project	40%	May 17, 8 am

Late submission of assignments

For each day the assignment is turned in late, the grade will be reduced by 10% (e.g. submission two days after the deadline would result in 20% grade deduction).

Attendance

Students are expected to be present and prepared for every class session. Active participation during lectures and seminar discussions is essential. If unavoidable circumstances arise which prevent attendance or preparation, the instructor should be advised by email with as much advance notice as possible. Please note that students cannot miss more than two sessions. For further information please consult the Examination Rules §9.

Academic integrity

The Hertie School of Governance is committed to the standards of good academic and ethical conduct. Any violation of these standards shall be subject to disciplinary action. Plagiarism, deceitful actions as well as free-riding in group work are not tolerated. See Examination Rules §15.

General Readings

The required readings for the course are:

Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly.
Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.

Session Overview

Session	Session Date	Session Title
1	06.02.2019	Introduction to Data Science
2	13.02.2019	Data Importation
3	20.02.2019	Data Cleaning
4	27.02.2019	Data Transformation
5	06.03.2019	Working with Relational Data
6	13.03.2019	Working with Strings
Mid-term Exam Week
7	27.03.2019	Web Scraping
8	03.04.2019	Supervised Machine Learning
9	10.04.2019	Unsupervised Machine Learning
10	17.04.2019	Data Communication
11	24.04.2019	Working with Big Data
12	08.05.2019	Guest Lecture

Course Sessions and Readings

All readings will be accessible on the Moodle course site before semester start. In the case that there is a change in readings, students will be notified by email.

Required readings are to be read and analysed thoroughly. Optional readings are intended to broaden your knowledge in the respective area and it is highly recommended to skim them at least.

Session 1: 06.02.2019	Introduction to Data Science
Learning Objectives	Learn about the course structure and data science fundamentals
Required Readings	- Harrison, E. 2015. RStudio and GitHub. R-bloggers.com
	- Interactive introduction to Git from the Code School
Optional Readings

Session 2: 13.02.2019	Data Importation
Learning Objectives	Learn how to read different file formats into R.
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 11.
Optional Readings

Session 3: 20.02.2019	Data Cleaning
Learning Objectives	Learn how to organize data consistently with tidyr
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 12.
	- Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10).
Optional Readings

Session 4: 27.02.2019	Data Transformation
Learning Objectives	Learn how to transform data with dplyr
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 5.
Optional Readings

Session 5: 06.03.2019	Working with Relational Data
Learning Objectives	Learn how to work with multiple tables of data
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 13.
Optional Readings

Session 6: 13.03.2019	Working with Strings
Learning Objectives	Learn how to effectively manipulate strings with stringr
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 14.
	- Wickham, Hadley. 2010. ‘‘stringr: modern, consistent string processing’’. The R Journal 2 (2): 38-40.
Optional Readings	- Bacon, Greg. Regular Expressions. Stackoverflow.com

Mid-term Exam Week: 18-22 March 2019 – no class

Session 7: 27.03.2019	Web Scraping
Learning Objectives	Learn how to automatically collect data off the web and interact with APIs
Required Readings	- Munzert, S., C. Rubba, P. Meißner and D. Nyhuis. 2015. Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining. Wiley. Chapter 9.
	- Law, J. and J. Rosenblum. 2015. rvest tutorial: scraping the web using R.
Optional Readings

Session 8: 03.04.2019	Supervised Machine Learning
Learning Objectives	Learn how to train an algorithm with labelled input data
Required Readings	- James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 4.
	- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2 & 10.
	- Kuhn, Max. 2018. A Short Introduction to the caret Package.
Optional Readings

Session 9: 10.04.2019	Unsupervised Machine Learning
Learning Objectives	Learn how to detect structure in unlabelled input data
Required Readings	- James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 10.
	- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2.
Optional Readings

Session 10: 17.04.2019	Data Communication
Learning Objectives	Learn how to (dynamically) communicate your data to others
Required Readings	- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 28.
	- Gelman, Andrew and Antony Unwin. 2012. “Infovis and Statistical Graphics: Different Goals, Different Looks.” Journal of Computational and Graphical Statistics 22(1): 2-28.
Additional Readings	- Plotly
	- Shiny

Session 11: 24.04.2019	Working with Big Data
Learning Objectives	Learn how to scale things up
Required Readings	- D. Schmidt, W.-C. Chen, M. A. Matheson, and G. Ostrouchov. 2016. Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes. Big Data Research 2016.
Optional Readings	- Plotly
	- Shiny

Session 12: 08.05.2019	Guest Lecture
Learning Objectives	Insights into data science from an industry professional
Required Readings	- DataCamp Blog. 2017. The Periodic Table of Data Science. R-bloggers.com
Optional Readings

andrespnc/AppliedDataScience