lse-me314.github.io: A Jupyter Notebook repository from qianyaoyy

LSE Methods Summer Programme 2021

London School of Economics and Political Science

Instructors

Kenneth Benoit (K.R.Benoit@lse.ac.uk), Department of Methodology, LSE
Jack Blumenau (j.blumenau@ucl.ac.uk), Department of Political Science, UCL

TAs

Pedro Alves (pedrosequeiraalves@gmail.com), LSE
Sarah Jewett (S.Jewett1@lse.ac.uk), LSE
Markus Kollberg (markus.kollberg.18@ucl.ac.uk), UCL
Julia Leschke (J.Leschke@lse.ac.uk), LSE

This repository contains the class materials for the Research Methods, Data Science, and Mathematics course ME314 Introduction to Data Science and Machine Learning taught in June-July 2021 by Kenneth Benoit and Jack Blumenau.

Quick links to topics

Day	Date	Instructor	Topic
1	Mo 21 Jun	KB	Overview and introduction to data science
2	Tu 22 Jun	KB	The Shape of Data
3	We 23 Jun	KB	Working with Databases
4	Th 24 Jun	KB	Linear Regression
5	Mo 28 Jun	KB	Classification
6	Tu 29 Jun	KB	Non-linear models and tree-based methods
7	We 30 Jun	JB	Resampling methods, model selection and regularization
8	Th 1 Jul	JB	Unsupervised learning and dimensional reduction
9	Fr 2 Jul	JB	Text analysis
10	Mo 5 Jul	JB	Text classification and scaling
11	Tu 6 Jul	JB	Topic modelling
12	We 7 Jul	JB	Data from the Web
13	Fr 9 Jul		Final Exam

Overview

Data science and machine learning are exciting new areas that combine scientific inquiry, statistical knowledge, substantive expertise, and computer programming. One of the main challenges for businesses and policy makers when using big data is to find people with the appropriate skills. Good data science requires experts that combine substantive knowledge with data analytical skills, which makes it a prime area for social scientists with an interest in quantitative methods.

This course integrates prior training in quantitative methods (statistics) and coding with substantive expertise and introduces the fundamental concepts and techniques of data science and machine learning.

Typical students will be advanced undergraduate and postgraduate students from any field requiring the fundamentals of data science or working with typically large datasets and databases. Practitioners from industry, government, or research organisations with some basic training in quantitative analysis or computer programming are also welcome. Because this course surveys diverse techniques and methods, it makes an ideal foundation for more advanced or more specific training. Our applications are drawn from social, political, economic, legal, and business and marketing fields.

Objectives

This course aims to provide an introduction quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. We will cover the main analytical methods from this field with hands-on applications using example datasets, so that students gain experience with and confidence in using the methods we cover. We also cover data preparation and processing, including working with structured databases, key-value formatted data (JSON), and unstructured textual data. At the end of this course students will have a sound understanding of the field of data science, the ability to analyse data using some of its main methods, and a solid foundation for more advanced or more specialised study.

The course will be delivered as a series of morning lectures (held from 10am to 1pm, with an extended break in the middle), followed by lab sessions in the afternoon where students will apply the lessons in a series of instructor-guided exercises using data provided as part of the exercises. The course will cover the following topics:

an overview of data science and the challenge of working with big data using statistical methods
how to integrate the insights from data analytics into knowledge generation and decision-making
how to acquire data, both structured and unstructured, and to process it, store it, and convert it into a format suitable for analysis
approaches to normalising data, using a database manager (SQLite), and working with SQL database queries
the basics of statistical inference including probability and probability distributions, modelling, experimental design
an overview of classification methods and related methods for assessing model fit and cross-validating predictive models
supervised learning approaches, including linear and logistic regression, decision trees, and naïve Bayes
unsupervised learning approaches, including clustering, association rules, and principal components analysis
quantitative methods of text analysis, including mining social media and other online resources
data visualisation through a variety of graphs.

Hybrid learning

Given this year's unusual circumstances, all teaching will be delivered such that students may participate either in person or online. See the Moodle site for ME314 for class lists, Zoom links, and announcements.

Lectures: Lectures will be held between 10am and 1pm each day. Students attending remotely will have the chance either to join via Zoom or to watch the recorded lecture once posted to Moodle as they prefer.
Classes:
- Two in-person classes each afternoon from 2pm-3.30pm
- Two online classes, held on Zoom, at 8am-9.30am and 5pm-6.30pm each day

Prerequisites

Students should already be familiar with quantitative methods at an introductory level, up to linear regression analysis. Familiarity with computer programming or database structures is a benefit, but not formally required.

Preparing for the course

You will need R and RStudio for this course. Because of the pandemic, we will require you to use your own computers during this course. You will need to download and install R and RStudio on your computer.

Detailed instructions can also be found here for installing the tools you need and working with the lab materials.

If you are not already familiar with R, we strongly encourage you to attempt to become familiar before the start of the course. That way, you will spend much less time become familiar with the tools, and be able to focus more on the methods. The following links provide a basic introduction to R, which you can study at your own pace before the course begins.

An Introduction to R.
Data Camp R tutorials.
Data Camp R Markdown tutorials, first chapter.
R Codeschool.

We also strongly recommend you spend some time before the course working through the following materials:

Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media. Note: Online version is available from the authors' page here.
James et al. (2013) An Introduction to Statistical Learning: With applications in R, Springer, Chapters 1--2. Note: The book is available from the authors' page here.

Important Specifics

Computer Software

Computer-based exercises will feature prominently in the course, especially in the lab sessions. The use of all software tools will be explained in the sessions, including how to download and install them. All of the class work will be done using R, using publicly available packages.

Main Texts

The primary texts are:

James et al. (2013) An Introduction to Statistical Learning: With applications in R, Springer. Note: The book is available from the authors' page here.
Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media. Note: Online version is available from the authors' page here.
Zumel, N. and Mount, J. (2014). Practical Data Science with R. Manning Publications.

The following are supplemental texts which you may also find useful:

Lantz, B. (2013). Machine Learning with R. Packt Publishing.
Lesmeister, C. (2015). Mastering Machine Learning with R. Packt Publishing.
Conway, D. and White, J. (2012) Machine Learning for Hackers. O'Reilly Media.
Leskovec, J., Rajaraman, A. and Ullman, J. (2011). Mining of Massive Datasets. Cambridge University Press.
Zafarani, R., Abbasi, M. A. and Liu, H. (2014) Social Media Mining: An introduction. Cambridge University Press.
Hastie et al. (2009) The Elements of Statistical Learning: Data mining, inference, and prediction. Springer. Note: The book is available from the authors' page here.

Instructors

Kenneth Benoit is Professor of Computational Social Science at the Department of Methodology, LSE. With a background in political science, his substantive work focuses on political party competition, political measurement issues, and electoral systems. His research and teaching is primarily in the field of social science statistical applications. His recent work concerns the quantitative analysis of text as data, for which he has developed the package(s) quanteda for the R statistical software.

Jack Blumenau is an Assistant Professor in Quantitative Methods at the UCL Department of Political Science, and a Data Science Advisor to YouGov. His research is primarily in the fields of legislative and electoral politics.

Assessment

Daily lab exercises

These are not assessed, but will form the practical materials for each day's labs. See these instructions for how to access and work with each day's exercise.

See https://lse-me314.github.io/instructions for detailed instructions on obtaining and working with each day's lab materials.

Mid-term project

The class assignment for Day 5 will count as the mid-term assignment, which will count for 25% of the grade.

Exam

The final exam will be set on Friday 9th July. Details will follow in Week 2.

Detailed Course Schedule

1. Overview and introduction to data science

We will use this session to get to know the range of interests and experience students bring to the class, as well as to survey the machine learning approaches to be covered. We will also discuss and demonstrate the R software.

Resources

Required reading

James et al (2013), Chapters 1--2. Note: The book is available from the authors' page here.
An Introduction to R.
Downloading and installing RStudio and R on your computer.
Data Camp R tutorials.
Data Camp R Markdown tutorials, first chapter.
R Codeschool.
Garrett Grolemund and Hadley Wickham (2016) R for Data Science, O'Reilly Media, Chapters 1-3. Note: Online version is available from the authors' page here.

2. The shape of data

This week introduces the concept of data "beyond the spreadsheet", the rectangular format most common in statistical datasets. It covers relational structures and the concept of database normalization. We will also cover ways to restructure data from "wide" to "long" format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.

Resources

Required reading

Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O'Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition; Ch. 9-12 online).

If you use Python, this can help your frame of reference

Reshaping data in Python: "Reshaping and Pivot Tables".
Robin Linderborg, "Reshaping Data in Python", 20 Jan 2017.

3. Working with databases

We will introduce the concept of database normalization, and how to implement this using good practice in a relational database manager, SQLite. We will cover how to structure data, verify data types, set conditions for data integrity, and perform complex queries to extract data from the database. We will also cover authentication and how to connect to local and remote databases.

Resources

Required reading

Lake, Peter. Concise Guide to Databases: A Practical Introduction. Springer, 2013. Chapters 4-5, Relational Databases and NoSQL databases.
Nield, Thomas. Getting Started with SQL: A hands-on approach for beginners. O’Reilly, 2016. Entire text.

4. Linear regression

Linear regression model and supervised learning.

Resources

Required Reading

James et al., Chapter 3.

5. Classification

Logistic regression, discriminant analysis, Naive Bayes, evaluating model performance.

Resources

Lecture Notes

The mid-term exam will be posted on Moodle.

Solutions to the midterm will also be posted on Moodle.

Required Reading

James et al., Chapter 4.

6. Non-linear models and tree-based methods

GAMs, local regression, decision trees, random forest, boosting.

Resources

Required Reading

James et al., Chapter 7-8.

7. Resampling methods, model selection and regularization

Cross-validation, bootstrap, ridge and lasso.

Resources

Required Reading

James et al., Chapter 5-6.

8. Unsupervised learning and dimensional reduction

Cluster analysis, PCA

Resources

Required reading

James et al., Chapter 10.

9. Text analysis

Working with text in R, sentiment analysis, dictionary methods.

Resources

Required reading

Grimmer, J, and B M Stewart (2013), ``Text as Data: the Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.'' Political Analysis.
Benoit, Kenneth and Alexander Herzog. In press. ``Text Analysis: Estimating Policy Preferences From Written and Spoken Words.''.'' In Analytics, Policy and Governance, eds. Jennifer Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.

10. Text classification and scaling

Naive Bayes classifier, Wordscores, and Wordfish.

Resources

Lecture Notes as pdf
Lab 10 materials
Lab 10 solution as RMarkdown or as HTML.

Required reading

Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311-331. doi:10.1017/S0003055403000698

Slapin, J. B. and Proksch, S. (2008), A Scaling Model for Estimating Time‐Series Party Positions from Texts. American Journal of Political Science, 52: 705-722. doi:10.1111/j.1540-5907.2008.00338.x

11. Topic modelling

Latent Dirichlet Allocation, Correlated Topic Model, Structural Topic Model.

Resources

Lecture Notes as pdf
Lab 11 materials
Lab 11 solution as RMarkdown or as HTML.

Required reading

David Blei (2012). "Probabilistic topic models."" Communications of the ACM, 55(4): 77-84.
Blei, David, Andrew Y. Ng, and Michael I. Jordan (2003). "Latent dirichlet allocation." Journal of Machine Learning Research 3: 993-1022.
Blei, David (2014) "Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models." Annual Review of Statistics and Its Application, 1: 203-232.

12. Data from the web

The promises and pitfalls of social media data. The Twitter API. The Facebook API. Web scraping. Ethics.

Resources

Lecture Notes as pdf
Lab 12 materials
Lab 12 solution as RMarkdown or as HTML.

qianyaoyy/lse-me314.github.io

LSE Methods Summer Programme 2021

Quick links to topics

Overview

Objectives

Hybrid learning

Prerequisites

Preparing for the course

Important Specifics

Computer Software

Main Texts

Instructors

Assessment

Daily lab exercises

Mid-term project

Exam

Detailed Course Schedule

1. Overview and introduction to data science

Resources

Required reading

Recommended Reading

2. The shape of data

Resources

Required reading

If you use Python, this can help your frame of reference

3. Working with databases

Resources

Required reading

Recommended Reading

4. Linear regression

Resources

Required Reading

Recommended Reading

5. Classification

Resources

Required Reading

Recommended Reading

6. Non-linear models and tree-based methods

Resources

Required Reading

Recommended Reading

7. Resampling methods, model selection and regularization

Resources

Required Reading

Recommended Reading

8. Unsupervised learning and dimensional reduction

Resources

Required reading

Recommended Reading

9. Text analysis

Resources

Required reading

Recommended Reading

10. Text classification and scaling

Resources

Required reading

Recommended Reading

11. Topic modelling

Resources

Required reading

Recommended Reading

12. Data from the web

Resources

Recommended Reading: