/dsr

Introduction to Data Science with R (Sciences Po, Paris, 2023)

Primary LanguageR

> Introduction to Data Science with R

François Briatte
Spring 2023. Work in progress.

An introduction to data science with R, RStudio, and the {tidyverse} packages, aimed at social scientists with little to no training in statistical computing and related topics.

> Syllabus
> Readings (handbooks, videos, tutorials and more)

This folder contains the code, data and documentation of the examples used either during the practice sessions in class, or distributed as homework exercises. Slides and exercise solutions are not included.

Outline

  1. Software
  2. Workflow
  3. Data
  4. Visualization
  5. Description
  6. Association
  7. Correlation
  8. Regression
  9. Nonlinearity
  10. Surveys
  11. Classification
  12. Extensions

Bonus sections:

Part 1. Basics

Software setup, first steps with coding, handling data, and plotting things.

1. Software

  • RStudio interface
    • The panes layout
    • Executing code from the Console
    • Executing code from a script: Ctrl-Enter
    • Setting preferences
  • R syntax
    • Comments (#) and code
    • Functions and arguments
    • Objects and assignment: <-
    • Package installation

> Readings
> Exercise 1: Generative art

2. Workflow

  • More on the RStudio interface
    • Setting the working directory
    • Doing so by using RStudio project files: .Rproj
    • The Files and Plots panes
    • Executing code down to a given line: Ctrl-Alt-B
  • More R syntax essentials
    • Code spanning multiple lines, and pipes: %>%, |>
    • R objects and types
    • Data frames, variables and values
    • R has many packages and sub-syntaxes: base, {tidyverse}, {ggplot2}, etc.

> Readings
> Demo 1: Cholera deaths in London, 1854 (John Snow)
> Demo 2: Industrial disputes and left-wing seat shares (CWS 2020)
> Exercise 2: Weird R syntax

3. Data

Data wrangling, mostly with the {dplyr} package.

  • Data I/O
    • reading/writing datasets with {readr}, {haven} and {readxl}
    • inspecting datasets: head, str, view, glimpse
    • passing mentions -- strings, factors, dates and special formats
    • passing mentions -- SQL and databases, and data engineering
  • Data manipulation on a single dataset
    • selecting variables: $, select and $`special cases`
    • sorting (ordering): arrange
    • subsetting: filter
    • aggregating and summarising values: group_by + summarise
    • reshaping: pivot_longer, pivot_wider
  • Data manipulation on multiple datasets
    • joining (merging) two datasets: full_join and the like
    • binding multiple datasets: bind_rows
  • Recoding and transforming values: mutate
    • 'if/else' recodes: if_else
    • type coercion/conversion: as.numeric, as.integer etc.
    • handling missing values: is.na, na_if, drop_na
    • handling text with {stringr} and regular expressions, a.k.a. regex

> Readings
> Demo 1: Covid-19 and global income inequality (Deaton)
> Demo 2: Visualizing the 'EU mood' (Guinaudeau and Schnatterer)
> Exercise 3: Satisfaction with democracy in Hungary and Poland (Eurobarometer)

4. Visualization

Plots, mostly with the {ggplot2} package.

  • Principles of data abstraction
  • Plotting engines
  • The ‘grammar of graphics’ approach

> Readings
> Demo: Economic growth and public debt (Reinhart and Rogoff)
> Bonus 1: Mapping life expectancy worldwide
> Bonus 2: Anscombe's quartet
> Exercise 4: Life expectancy and GDP per capita (Preston curve)

Part 2. Essentials

Descriptive and inferential statistics, the frequentist way (no time for Bayesian ones, I'm afraid). This section will briefly mention some more advanced topics related to regression models, statisical estimation and machine learning.

5. Description

Summary statistics and distributions. Also covering sampling, and possibly bootstrap resampling if time permits (which of course won't happen).

  • Describing a distribution
    • Central tendency
    • Dispersion
    • Quantiles
    • Proportions
  • Inference
    • The ‘normal’ distribution
    • The Central Limit Theorem (CLT) and the Law of Large Numbers (LLN)
    • Standard errors
    • Confidence intervals

> Readings
> Demo: Colonialism, democracy, life expectancy and wealth, Part 1
> Exercise 5: Trust in Islamist parties (graded homework)

6. Association

Statistical tests to compare means and proportions.

  • Association tests
  • Statistical significance
  • Comparisons of means
  • Comparisons of proportions

> Readings
> Demo: Colonialism, democracy, life expectancy and wealth, Part 2
> No exercise this week -- catch up on all previously distributed material

7. Correlation

Linear and nonlinear, as an introduction to linear and nonlinear models, with some basic philosophy of data quantitative social statistical science.

  • Correlation, the actual thing
  • Linearity and nonlinearity
  • Data-generating processes and stylized facts
  • Fitting functions to joint distributions

> Readings
> Demo: Social democratic capitalism (Kenworthy)
> Exercise 7: US Republican vote shares and life expectancy (Case and Deaton)

8. Regression

Linear regression, the full package: least squares, dummies, interactions, diagnostics, marginal effects. All in one session, if things go well, but this usually takes half of any introductory statistics course.

  • Estimation: fitting linear models via Ordinary Least Squares (OLS)
    • Modelling your ‘response’ (dependent variable)
    • Interpreting your coefficients
    • Categorical predictors (independent variables): handling ‘dummies’
    • Interaction terms: ‘multiplying’ your predictors
  • Postestimation: what to do after fitting a linear model
    • Goodness-of-fit
    • Diagnostics: residuals, multicollinearity and heteroscedasticity
    • Additional diagnostics: outliers and ‘influential observations’
    • Marginal effects
  • Model manipulation with the {broom} package

> Readings
> Demo: U.S. presidential election outcomes and income growth (Bartels)
> Exercise 8: Growth forecasts and fiscal consolidation (IMF/Giles)

9. Nonlinearity

Focusing mostly exclusively on logistic regression, but hoping to also introduce more fun stuff with no time to say more about other generalized models.

  • Generalized liner models
  • The logit ‘link’ function
  • Log-odds and odds ratios

> Readings
> Demo: Opposition to abortion in Canada (CES 2021)
> Exercise 9: Predicting Covid-19 lockdowns (graded homework)

10. Surveys

Surveys, and how to handle survey weights, with the {survey} and {srvyr} packages. Not yet online, work in progress.

> Readings
> Demo: ..
> Exercise 10: Economic insecurity and religious reassurance (ESS)

Part 3. Extras

Statistical learning and machine learning could go here, as well as APIs and Web scraping, networks, big data and more things like JavaScript visualization libraries, but there are only two extra sessions.

11. Classification

Dimensionality reduction, principal components, clustering and partitioning, using {factoextra} and related packages to visualise the results.

> Readings
> Demo 1: Protein consumption in European countries, 1973
> Demo 2: Feelings towards politicians in France (CNEP 2017)
> No exercise this week -- catch up on all previously distributed material

12. Extensions

Students manifested an interest in maps and text, so let's cover this, before closing on mentions of other useful things.

  • Maps with {sf}
    • Spatial visualization and coordinate reference systems
    • Centroids and interpolation
    • Going further: passing mention of {sfdep}
  • Text analysis with {tidytext}
    • Tokenization and stopwords
    • Sentiment analysis
    • n-grams
    • Going further: passing mentions of {quanteda}, {topicmodels}, and {stm}
  • Going further with R
  • Other advanced topics
    • APIs and Web scraping
    • Python and machine learning
    • Bayesian models

> Readings
> Demo 1: Mapping support for fossil fuel taxation (ESS)
> Demo 2: Mining into Greta Thunberg's speeches
> Exercise 12: data science skills


Dependencies

The course runs on R 4.x and depends on the following packages:

install.packages("remotes")

# required for multiple sessions
pkgs <- c("broom", "countrycode", "e1071", "ggmosaic", "ggeffects", "ggrepel", 
          "moments", "performance", "sf", "texreg", "tidyverse", "WDI")
remotes::install_cran(pkgs)

# required for Session 11 only
s11 <- c("car", "corrr", "factoextra", "ggcorrplot", "ggfortify", "plotly")
remotes::install_cran(s11)

# required for Session 12 only
s12_maps <- c("gstat", "stars")
s12_text <- c("igraph", "ggraph", "pdftools", "tidytext")
remotes::install_cran(c(s12_maps, s12_text))

# optional (used to prepare the course datasets)
xtra <- c("rvest")
remotes::install_cran(xtra)

Credits

The last time I had a chance to build such a course was ten years ago, with Ivaylo D. Petev. Some of the inspiration for this course dates back to that time.

In the meantime, I have taught a few other quantitative methods courses, including some tutorials and guest lectures for Jan Rovny's own courses. Some of the material for this course comes from those other ones.

Some thanks go to Kim Antunez, who will be soon teaching her own version of this course, and who suggested some of the readings that made it to my own list.

Some thanks also go to Joël Gombin and Timothée Gidoin, who inspired and helped with a first draft of this course, six years before it actually ran for the first time.

Last, this course and all the other ones mentioned above took place at Sciences Po in Paris, France, where some more inspiration has come from Emiliano Grossman and many others.

The ASCII art in some scripts is by Patrick Gillespie.

Elsewhere

Most of this course is available on GitHub, where a wiki page lists other similar courses. I would love it if the present course were as good as those listed there, but cannot guarantee it.