François Briatte
Spring 2023. Work in progress.
An introduction to data science with R, RStudio, and the {tidyverse}
packages, aimed at social scientists with little to no training in statistical computing and related topics.
>
Syllabus
>
Readings (handbooks, videos, tutorials and more)
This folder contains the code, data and documentation of the examples used either during the practice sessions in class, or distributed as homework exercises. Slides and exercise solutions are not included.
- Software
- Workflow
- Data
- Visualization
- Description
- Association
- Correlation
- Regression
- Nonlinearity
- Surveys
- Classification
- Extensions
Bonus sections:
Software setup, first steps with coding, handling data, and plotting things.
- RStudio interface
- The panes layout
- Executing code from the Console
- Executing code from a script:
Ctrl-Enter
- Setting preferences
- R syntax
- Comments (
#
) and code - Functions and arguments
- Objects and assignment:
<-
- Package installation
- Comments (
>
Readings
>
Exercise 1: Generative art
- More on the RStudio interface
- Setting the working directory
- Doing so by using RStudio project files:
.Rproj
- The Files and Plots panes
- Executing code down to a given line:
Ctrl-Alt-B
- More R syntax essentials
- Code spanning multiple lines, and pipes:
%>%
,|>
- R objects and types
- Data frames, variables and values
- R has many packages and sub-syntaxes: base,
{tidyverse}
,{ggplot2}
, etc.
- Code spanning multiple lines, and pipes:
>
Readings
>
Demo 1: Cholera deaths in London, 1854 (John Snow)
>
Demo 2: Industrial disputes and left-wing seat shares (CWS 2020)
>
Exercise 2: Weird R syntax
Data wrangling, mostly with the {dplyr}
package.
- Data I/O
- reading/writing datasets with
{readr}
,{haven}
and{readxl}
- inspecting datasets:
head
,str
,view
,glimpse
- passing mentions -- strings, factors, dates and special formats
- passing mentions -- SQL and databases, and data engineering
- reading/writing datasets with
- Data manipulation on a single dataset
- selecting variables:
$
,select
and$`special cases`
- sorting (ordering):
arrange
- subsetting:
filter
- aggregating and summarising values:
group_by
+summarise
- reshaping:
pivot_longer
,pivot_wider
- selecting variables:
- Data manipulation on multiple datasets
- joining (merging) two datasets:
full_join
and the like - binding multiple datasets:
bind_rows
- joining (merging) two datasets:
- Recoding and transforming values:
mutate
- 'if/else' recodes:
if_else
- type coercion/conversion:
as.numeric
,as.integer
etc. - handling missing values:
is.na
,na_if
,drop_na
- handling text with
{stringr}
and regular expressions, a.k.a. regex
- 'if/else' recodes:
>
Readings
>
Demo 1: Covid-19 and global income inequality (Deaton)
>
Demo 2: Visualizing the 'EU mood' (Guinaudeau and Schnatterer)
>
Exercise 3: Satisfaction with democracy in Hungary and Poland (Eurobarometer)
Plots, mostly with the {ggplot2}
package.
- Principles of data abstraction
- Plotting engines
- The ‘grammar of graphics’ approach
>
Readings
>
Demo: Economic growth and public debt (Reinhart and Rogoff)
>
Bonus 1: Mapping life expectancy worldwide
>
Bonus 2: Anscombe's quartet
>
Exercise 4: Life expectancy and GDP per capita (Preston curve)
Descriptive and inferential statistics, the frequentist way (no time for Bayesian ones, I'm afraid). This section will briefly mention some more advanced topics related to regression models, statisical estimation and machine learning.
Summary statistics and distributions. Also covering sampling, and possibly bootstrap resampling if time permits (which of course won't happen).
- Describing a distribution
- Central tendency
- Dispersion
- Quantiles
- Proportions
- Inference
- The ‘normal’ distribution
- The Central Limit Theorem (CLT) and the Law of Large Numbers (LLN)
- Standard errors
- Confidence intervals
>
Readings
>
Demo: Colonialism, democracy, life expectancy and wealth, Part 1
>
Exercise 5: Trust in Islamist parties (graded homework)
Statistical tests to compare means and proportions.
- Association tests
- Statistical significance
- Comparisons of means
- Comparisons of proportions
>
Readings
>
Demo: Colonialism, democracy, life expectancy and wealth, Part 2
>
No exercise this week -- catch up on all previously distributed material
Linear and nonlinear, as an introduction to linear and nonlinear models, with some basic philosophy of data quantitative social statistical science.
- Correlation, the actual thing
- Linearity and nonlinearity
- Data-generating processes and stylized facts
- Fitting functions to joint distributions
>
Readings
>
Demo: Social democratic capitalism (Kenworthy)
>
Exercise 7: US Republican vote shares and life expectancy (Case and Deaton)
Linear regression, the full package: least squares, dummies, interactions, diagnostics, marginal effects. All in one session, if things go well, but this usually takes half of any introductory statistics course.
- Estimation: fitting linear models via Ordinary Least Squares (OLS)
- Modelling your ‘response’ (dependent variable)
- Interpreting your coefficients
- Categorical predictors (independent variables): handling ‘dummies’
- Interaction terms: ‘multiplying’ your predictors
- Postestimation: what to do after fitting a linear model
- Goodness-of-fit
- Diagnostics: residuals, multicollinearity and heteroscedasticity
- Additional diagnostics: outliers and ‘influential observations’
- Marginal effects
- Model manipulation with the
{broom}
package
>
Readings
>
Demo: U.S. presidential election outcomes and income growth (Bartels)
>
Exercise 8: Growth forecasts and fiscal consolidation (IMF/Giles)
Focusing mostly exclusively on logistic regression, but hoping to also introduce more fun stuff with no time to say more about other generalized models.
- Generalized liner models
- The logit ‘link’ function
- Log-odds and odds ratios
>
Readings
>
Demo: Opposition to abortion in Canada (CES 2021)
>
Exercise 9: Predicting Covid-19 lockdowns (graded homework)
Surveys, and how to handle survey weights, with the {survey}
and {srvyr}
packages. Not yet online, work in progress.
>
Readings
>
Demo: ..
>
Exercise 10: Economic insecurity and religious reassurance (ESS)
Statistical learning and machine learning could go here, as well as APIs and Web scraping, networks, big data and more things like JavaScript visualization libraries, but there are only two extra sessions.
Dimensionality reduction, principal components, clustering and partitioning, using {factoextra}
and related packages to visualise the results.
>
Readings
>
Demo 1: Protein consumption in European countries, 1973
>
Demo 2: Feelings towards politicians in France (CNEP 2017)
>
No exercise this week -- catch up on all previously distributed material
Students manifested an interest in maps and text, so let's cover this, before closing on mentions of other useful things.
- Maps with
{sf}
- Spatial visualization and coordinate reference systems
- Centroids and interpolation
- Going further: passing mention of
{sfdep}
- Text analysis with
{tidytext}
- Tokenization and stopwords
- Sentiment analysis
- n-grams
- Going further: passing mentions of
{quanteda}
,{topicmodels}
, and{stm}
- Going further with R
- Version control with Git/GitHub
- Dynamic documents with R Markdown and Quarto
- Other advanced topics
- APIs and Web scraping
- Python and machine learning
- Bayesian models
>
Readings
>
Demo 1: Mapping support for fossil fuel taxation (ESS)
>
Demo 2: Mining into Greta Thunberg's speeches
>
Exercise 12: data science skills
The course runs on R 4.x and depends on the following packages:
install.packages("remotes")
# required for multiple sessions
pkgs <- c("broom", "countrycode", "e1071", "ggmosaic", "ggeffects", "ggrepel",
"moments", "performance", "sf", "texreg", "tidyverse", "WDI")
remotes::install_cran(pkgs)
# required for Session 11 only
s11 <- c("car", "corrr", "factoextra", "ggcorrplot", "ggfortify", "plotly")
remotes::install_cran(s11)
# required for Session 12 only
s12_maps <- c("gstat", "stars")
s12_text <- c("igraph", "ggraph", "pdftools", "tidytext")
remotes::install_cran(c(s12_maps, s12_text))
# optional (used to prepare the course datasets)
xtra <- c("rvest")
remotes::install_cran(xtra)
The last time I had a chance to build such a course was ten years ago, with Ivaylo D. Petev. Some of the inspiration for this course dates back to that time.
In the meantime, I have taught a few other quantitative methods courses, including some tutorials and guest lectures for Jan Rovny's own courses. Some of the material for this course comes from those other ones.
Some thanks go to Kim Antunez, who will be soon teaching her own version of this course, and who suggested some of the readings that made it to my own list.
Some thanks also go to Joël Gombin and Timothée Gidoin, who inspired and helped with a first draft of this course, six years before it actually ran for the first time.
Last, this course and all the other ones mentioned above took place at Sciences Po in Paris, France, where some more inspiration has come from Emiliano Grossman and many others.
The ASCII art in some scripts is by Patrick Gillespie.
Most of this course is available on GitHub, where a wiki page lists other similar courses. I would love it if the present course were as good as those listed there, but cannot guarantee it.