Web Scraping With R

Installing requirements
Web scraping with rvest
Web scraping with RSelenium

This tutorial covers the basics of web scraping with R. We’ll begin with the scraping of static pages and shift the focus to the techniques that can be used for scraping data from dynamic websites that use JavaScript to render the content.

For a detailed explanation, see this blog post.

Installing requirements

For macOS, run the following:

brew install r
brew install --cask r-studio

For Windows, run the following:

choco install r.project
choco install r.studio

Installing required libraries

install.packages("rvest")
install.packages("dplyr")

Web scraping with rvest

library(rvest)
link = "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"
page = read_html(link)

Parsing HTML Content

page %>% html_elements(css="")
page %>% html_elements(xpath="")

For above page, use the following:

htmlElement <- page %>% html_element("table.sortable")

Saving data to a data frame

df <- html_table(htmlEl, header = FALSE)
names(df) <- df[2,]
df = df[-1:-2,]

Exporting data frame to a CSV file

write.csv(df, "iso_codes.csv")

Downloading Images

page <- read_html(url)
image_element <- page %>% html_element(".thumbborder")
image_url <- image_element %>% html_attr("src")
download.file(image_url, destfile = basename("paris.jpg"))

Scrape Dynamic Pages with Rvest

Find the API endpoint and use that as following:

page<-read_html(GET(api_url, timeout(10)))
jsontext <- page %>% html_element("p")  %>% html_text()

For a complete example, see dynamic_rvest.R.

Web scraping with RSelenium

install.package("RSelenium")
library(RSelenium)

Starting Selenium

Method 1

# Method 1
rD <- rsDriver(browser="chrome", port=9515L, verbose=FALSE)
remDr <- rD[["client"]]

Method 2

docker run -d -p 4445:4444 selenium/standalone-firefox

remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)
remDr$open()

Working with elements in Selenium

remDr$navigate("https://books.toscrape.com/catalogue/category/books/science-fiction_16")

titleElements <- remDr$findElements(using = "xpath", "//article//img")
titles <- sapply(titleElements, function(x){x$getElementAttribute("alt")[[1]]})

pricesElements <- remDr$findElements(using = "xpath", "//*[@class='price_color']")
prices <-  sapply(pricesElements, function(x){x$getElementText()[[1]]})

stockElements <- remDr$findElements(using = "xpath", "//*[@class='instock availability']")
stocks <-  sapply(stockElements, function(x){x$getElementText()[[1]]})

Creating a data frame

df <- data.frame(titles, prices, stocks)

Save CSV

write.csv(df, "books.csv")

If you wish to find out more about web scraping with R, see our blog post.

oxylabs/web-scraping-r