Contains slides and material for course on applied data science with R
Version: 16 January 2019
-
Instructor: Matthias Haber
-
Email : haber@hertie-school.org
Instructor Information
Matthias Haber is working as head of data analytics at Looping Group. Previously, he was a research scientist at the Hertie School of Governance with research interests in party politics, electoral behavior, machine learning, survey experiments, and measurement problems. He holds degrees from the University of Mannheim, the University of Essex, and the University of Potsdam.
As data are increasingly available online, data analysis has replaced data acquisition as the bottleneck to empirical research in the social sciences. 80% of empirical research is spent sourcing, cleaning and preparing often noisy data, while the remaining 20% is actual data analysis. Extracting knowledge from heterogeneous datasets requires not only computational tools, but the programming skills to use them effectively.
This course introduces computational methods needed for data generation, data manipulation, data visualization, and data reproducibility and provides students with the ability to apply them to their own projects. The course is organized in three parts. The first part of the course will introduce ways to effectively extract, load, transform, and visualize structured and unstructured data. The second and third part will focus on practical applications of data science methods in academic research and in the industry respectively.
There is an increasing demand inside and outside of academia for skills to effectively analyze data as well as present results to a range of audiences making this course equally relevant for students seeking scientific or business careers.
The goals of the course are to learn ways to efficiently import, explore and communicate messy data from various sources and to get an overview of current industry data science solutions.
The course requires the completion of small, weekly homework exercises and a final data project. Political science thrives of collaboration and co-authorship. Hence, the participants are allowed (but not required) to complete their homework exercises and their final projects in two-person teams. The data project is due in the final exam week.
1. Weekly Assignments
Each week students have to complete small homework exercises that allow them to directly apply the techniques they learned in class. Homework exercises contribute 6% to the final grade each and students are encouraged to complete them in pairs.
2. Final project
For the final data project students are given a large dataset and will analyze it and present their results using their own ideas and skills learned throughout the course.
Composition of the Final Grade
Name | Percent of Final Mark | Due |
---|---|---|
Weekly Assignments | 60% | Tuesdays, 8 am |
Final Data Project | 40% | May 17, 8 am |
Late submission of assignments
For each day the assignment is turned in late, the grade will be reduced by 10% (e.g. submission two days after the deadline would result in 20% grade deduction).
Attendance
Students are expected to be present and prepared for every class session. Active participation during lectures and seminar discussions is essential. If unavoidable circumstances arise which prevent attendance or preparation, the instructor should be advised by email with as much advance notice as possible. Please note that students cannot miss more than two sessions. For further information please consult the Examination Rules §9.
Academic integrity
The Hertie School of Governance is committed to the standards of good academic and ethical conduct. Any violation of these standards shall be subject to disciplinary action. Plagiarism, deceitful actions as well as free-riding in group work are not tolerated. See Examination Rules §15.
The required readings for the course are:
- Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly.
- Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.
Session | Session Date | Session Title |
---|---|---|
1 | 06.02.2019 | Introduction to Data Science |
2 | 13.02.2019 | Data Importation |
3 | 20.02.2019 | Data Cleaning |
4 | 27.02.2019 | Data Transformation |
5 | 06.03.2019 | Working with Relational Data |
6 | 13.03.2019 | Working with Strings |
Mid-term Exam Week | ||
7 | 27.03.2019 | Web Scraping |
8 | 03.04.2019 | Supervised Machine Learning |
9 | 10.04.2019 | Unsupervised Machine Learning |
10 | 17.04.2019 | Data Communication |
11 | 24.04.2019 | Working with Big Data |
12 | 08.05.2019 | Guest Lecture |
All readings will be accessible on the Moodle course site before semester start. In the case that there is a change in readings, students will be notified by email.
Required readings are to be read and analysed thoroughly. Optional readings are intended to broaden your knowledge in the respective area and it is highly recommended to skim them at least.
Session 1: 06.02.2019 | Introduction to Data Science |
---|---|
Learning Objectives | Learn about the course structure and data science fundamentals |
Required Readings | - Harrison, E. 2015. RStudio and GitHub. R-bloggers.com |
- Interactive introduction to Git from the Code School | |
Optional Readings |
Session 2: 13.02.2019 | Data Importation |
---|---|
Learning Objectives | Learn how to read different file formats into R. |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 11. |
Optional Readings |
Session 3: 20.02.2019 | Data Cleaning |
---|---|
Learning Objectives | Learn how to organize data consistently with tidyr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 12. |
- Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10). | |
Optional Readings |
Session 4: 27.02.2019 | Data Transformation |
---|---|
Learning Objectives | Learn how to transform data with dplyr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 5. |
Optional Readings |
Session 5: 06.03.2019 | Working with Relational Data |
---|---|
Learning Objectives | Learn how to work with multiple tables of data |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 13. |
Optional Readings |
Session 6: 13.03.2019 | Working with Strings |
---|---|
Learning Objectives | Learn how to effectively manipulate strings with stringr |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 14. |
- Wickham, Hadley. 2010. ‘‘stringr: modern, consistent string processing’’. The R Journal 2 (2): 38-40. | |
Optional Readings | - Bacon, Greg. Regular Expressions. Stackoverflow.com |
Mid-term Exam Week: 18-22 March 2019 – no class
Session 7: 27.03.2019 | Web Scraping |
---|---|
Learning Objectives | Learn how to automatically collect data off the web and interact with APIs |
Required Readings | - Munzert, S., C. Rubba, P. Meißner and D. Nyhuis. 2015. Automated Data Collection with R A Practical Guide to Web Scraping and Text Mining. Wiley. Chapter 9. |
- Law, J. and J. Rosenblum. 2015. rvest tutorial: scraping the web using R. | |
Optional Readings |
Session 8: 03.04.2019 | Supervised Machine Learning |
---|---|
Learning Objectives | Learn how to train an algorithm with labelled input data |
Required Readings | - James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 4. |
- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2 & 10. | |
- Kuhn, Max. 2018. A Short Introduction to the caret Package. | |
Optional Readings |
Session 9: 10.04.2019 | Unsupervised Machine Learning |
---|---|
Learning Objectives | Learn how to detect structure in unlabelled input data |
Required Readings | - James, Gareth, et al. 2013. An introduction to statistical learning. Vol. 7. New York: springer. Chapter 10. |
- Kuhn, Max, and Kjell Johnson. 2013. Applied predictive modeling. New York: Springer. Chapter 2. | |
Optional Readings |
Session 10: 17.04.2019 | Data Communication |
---|---|
Learning Objectives | Learn how to (dynamically) communicate your data to others |
Required Readings | - Wickham, H. and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly. Chapter 28. |
- Gelman, Andrew and Antony Unwin. 2012. “Infovis and Statistical Graphics: Different Goals, Different Looks.” Journal of Computational and Graphical Statistics 22(1): 2-28. | |
Additional Readings | - Plotly |
- Shiny |
Session 11: 24.04.2019 | Working with Big Data |
---|---|
Learning Objectives | Learn how to scale things up |
Required Readings | - D. Schmidt, W.-C. Chen, M. A. Matheson, and G. Ostrouchov. 2016. Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes. Big Data Research 2016. |
Optional Readings | - Plotly |
- Shiny |
Session 12: 08.05.2019 | Guest Lecture |
---|---|
Learning Objectives | Insights into data science from an industry professional |
Required Readings | - DataCamp Blog. 2017. The Periodic Table of Data Science. R-bloggers.com |
Optional Readings |