/AppliedDataScience

Contains slides and material for course on applied data science with R

MIT LicenseMIT

GRAD-E1294: Applied Data Science with R

Contains slides and material for course on applied data science with R

Spring 2019

Version: 16 January 2019

General Information

Instructor Information

Matthias Haber is working as head of data analytics at Looping Group. Previously, he was a research scientist at the Hertie School of Governance with research interests in party politics, electoral behavior, machine learning, survey experiments, and measurement problems. He holds degrees from the University of Mannheim, the University of Essex, and the University of Potsdam.

Course Contents and Learning Objectives

As data are increasingly available online, data analysis has replaced data acquisition as the bottleneck to empirical research in the social sciences. 80% of empirical research is spent sourcing, cleaning and preparing often noisy data, while the remaining 20% is actual data analysis. Extracting knowledge from heterogeneous datasets requires not only computational tools, but the programming skills to use them effectively.

This course introduces computational methods needed for data generation, data manipulation, data visualization, and data reproducibility and provides students with the ability to apply them to their own projects. The course is organized in three parts. The first part of the course will introduce ways to effectively extract, load, transform, and visualize structured and unstructured data. The second and third part will focus on practical applications of data science methods in academic research and in the industry respectively.

There is an increasing demand inside and outside of academia for skills to effectively analyze data as well as present results to a range of audiences making this course equally relevant for students seeking scientific or business careers.

The goals of the course are to learn ways to efficiently import, explore and communicate messy data from various sources and to get an overview of current industry data science solutions.

Grading and Assignments

The course requires the completion of small, weekly homework exercises and a final data project. Political science thrives of collaboration and co-authorship. Hence, the participants are allowed (but not required) to complete their homework exercises and their final projects in two-person teams. The data project is due in the final exam week.

1. Weekly Assignments

Each week students have to complete small homework exercises that allow them to directly apply the techniques they learned in class. Homework exercises contribute 6% to the final grade each and students are encouraged to complete them in pairs.

2. Final project

For the final data project students are given a large dataset and will analyze it and present their results using their own ideas and skills learned throughout the course.

Composition of the Final Grade

Name Percent of Final Mark Due
Weekly Assignments 60% Tuesdays, 8 am
Final Data Project 40% May 17, 8 am

Late submission of assignments

For each day the assignment is turned in late, the grade will be reduced by 10% (e.g. submission two days after the deadline would result in 20% grade deduction).

Attendance

Students are expected to be present and prepared for every class session. Active participation during lectures and seminar discussions is essential. If unavoidable circumstances arise which prevent attendance or preparation, the instructor should be advised by email with as much advance notice as possible. Please note that students cannot miss more than two sessions. For further information please consult the Examination Rules §9.

Academic integrity

The Hertie School of Governance is committed to the standards of good academic and ethical conduct. Any violation of these standards shall be subject to disciplinary action. Plagiarism, deceitful actions as well as free-riding in group work are not tolerated. See Examination Rules §15.

General Readings

The required readings for the course are:

Session Overview

Session Session Date Session Title
1 06.02.2019 Introduction to Data Science
2 13.02.2019 Data Importation
3 20.02.2019 Data Cleaning
4 27.02.2019 Data Transformation
5 06.03.2019 Working with Relational Data
6 13.03.2019 Working with Strings
Mid-term Exam Week
7 27.03.2019 Web Scraping
8 03.04.2019 Supervised Machine Learning
9 10.04.2019 Unsupervised Machine Learning
10 17.04.2019 Data Communication
11 24.04.2019 Working with Big Data
12 08.05.2019 Guest Lecture

Course Sessions and Readings

All readings will be accessible on the Moodle course site before semester start. In the case that there is a change in readings, students will be notified by email.

Required readings are to be read and analysed thoroughly. Optional readings are intended to broaden your knowledge in the respective area and it is highly recommended to skim them at least.

Session 1: 06.02.2019 Introduction to Data Science
Learning Objectives Learn about the course structure and data science fundamentals
Required Readings - Harrison, E. 2015. RStudio and GitHub. R-bloggers.com
- Interactive introduction to Git from the Code School
Optional Readings
Session 2: 13.02.2019 Data Importation
Learning Objectives Learn how to read different file formats into R.
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 11.
Optional Readings
Session 3: 20.02.2019 Data Cleaning
Learning Objectives Learn how to organize data consistently with tidyr
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 12.
- Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10).
Optional Readings
Session 4: 27.02.2019 Data Transformation
Learning Objectives Learn how to transform data with dplyr
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 5.
Optional Readings
Session 5: 06.03.2019 Working with Relational Data
Learning Objectives Learn how to work with multiple tables of data
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 13.
Optional Readings
Session 6: 13.03.2019 Working with Strings
Learning Objectives Learn how to effectively manipulate strings with stringr
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 14.
- Wickham, Hadley. 2010. ‘‘stringr: modern, consistent string processing’’. The R Journal 2 (2): 38-40.
Optional Readings - Bacon, Greg. Regular Expressions. Stackoverflow.com

Mid-term Exam Week: 18-22 March 2019 – no class

Session 7: 27.03.2019 Web Scraping
Learning Objectives Learn how to automatically collect data off the web and interact with APIs
Required Readings - Munzert, S., C. Rubba, P. Meißner and D. Nyhuis. 2015. Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining. Wiley. Chapter 9.
- Law, J. and J. Rosenblum. 2015. rvest tutorial: scraping the web using R.
Optional Readings
Session 8: 03.04.2019 Supervised Machine Learning
Learning Objectives Learn how to train an algorithm with labelled input data
Required Readings - James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 4.
- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2 & 10.
- Kuhn, Max. 2018. A Short Introduction to the caret Package.
Optional Readings
Session 9: 10.04.2019 Unsupervised Machine Learning
Learning Objectives Learn how to detect structure in unlabelled input data
Required Readings - James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 10.
- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2.
Optional Readings
Session 10: 17.04.2019 Data Communication
Learning Objectives Learn how to (dynamically) communicate your data to others
Required Readings - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 28.
- Gelman, Andrew and Antony Unwin. 2012. “Infovis and Statistical Graphics: Different Goals, Different Looks.” Journal of Computational and Graphical Statistics 22(1): 2-28.
Additional Readings - Plotly
- Shiny
Session 11: 24.04.2019 Working with Big Data
Learning Objectives Learn how to scale things up
Required Readings - D. Schmidt, W.-C. Chen, M. A. Matheson, and G. Ostrouchov. 2016. Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes. Big Data Research 2016.
Optional Readings - Plotly
- Shiny
Session 12: 08.05.2019 Guest Lecture
Learning Objectives Insights into data science from an industry professional
Required Readings - DataCamp Blog. 2017. The Periodic Table of Data Science. R-bloggers.com
Optional Readings