Web Scraping with R, Extended Edition

General information

Summary

The web is full of data that are of great interest to scientists and businesses alike. Firms, public institutions, and private users provide every imaginable type of information, and new channels of communication generate vast amounts of data on human behavior. But how to efficiently collect data from the Internet; retrieve information from social networks, search engines, and dynamic web pages; tap web services; and, finally, process the collected data with statistical software? We will learn about the basics of web data collection practice with R. The sessions are hands-on; we will practice every step of the process with R using various examples. We will learn how to scrape content from static and dynamic web pages, connect to APIs from popular web services such as Twitter to read out and process user data, and set up automatically working scraper programs.

Event

Social Science Data Lab, MZES Mannheim

Date and Venue

Friday, October 14, MZES A Building, Room A-231

Instructor

Simon Munzert (website, Twitter)

Requirements

This course assumes prior experience using R. Please bring a laptop with the latest version of R and Rstudio installed (see more below for the technical setup).

Time schedule

	Time	Topic
Slot 1	09.00 a.m. - 10.15 a.m.	Introduction, setup, and overview
Slot 2	10.30 a.m. - 12.30 a.m.	Scraping static webpages with rvest
Slot 3	02.00 p.m. - 03.15 p.m.	Scraping dynamic webpages with RSelenium; good practice
Slot 4	03.30 p.m. - 05.00 p.m.	Tapping APIs

Accompanying book

Together with Christian Rubba, Peter Meissner, and Dominic Nyhuis, I've written a book on Automated Data Collection with R. Participants of the course might find it useful to consult it as further reading after the course. As a documentation of the course work, this code archive and the accompanying reader is fully sufficient though.

Technical setup for the course

Please make sure that the current version of R is installed. If not, update from here: https://cran.r-project.org/

Obviously, feel free to choose the coding environment you feel most comfortable with. I'll use RStudio in the course. You might want to use it, too: https://www.rstudio.com/products/rstudio/download/

We are going to need a couple of packages from CRAN: You can install them all by executing the following code chunk:

p_needed <- c("plyr", "dplyr", "stringr", "lubridate", "jsonlite", 
              "httr", "xml2", "rvest", "devtools", "ggmap",
               "networkD3", "RSelenium", "pageviews", "aRxiv", 
               "twitteR", "streamR")
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
  install.packages(p_to_install)
}

Finally, if you want to follow the code on Twitter mining live in the course, please consult the instructions to connect with Twitter as described here (first section, "Connecting with Twitter"): Connecting with Twitter using R

Live updates

A way to solve the encoding issue on Windows machines in the breweries example, as suggested by Stas Kolenikov:

Replace this line

locations <- str_extract(breweries, "[[:digit:]].+?–")

with the following:

locations <- str_extract(gsub(intToUtf8(0x2013),"-",breweries),"[[:digit:]].+?-")

Online resources