/google-covid-mobility-scrape

Script for scraping Google's COVID19 Community Mobility Reports

Primary LanguageRMIT LicenseMIT

google-covid-mobility-scrape

This is a repo to scrape the data from Google's COVID19 community mobility reports https://www.google.com/covid19/mobility/. This code is released freely under the MIT Licence, and provided 'as-is'.

Requirements

You'll need the packages: dplyr, purrr, xml2, rvest, pdftools and countrycode. These are all on CRAN.

NEWS

2020-04-04 16:51 get_all_data.R script pulls data from all reports, saved in the data folder

2020-04-04 16:26 Add comments to the functions, move tidyverse library call to scripts

2020-04-03 18:22 Converted code into a functions, added date and country codes into output tables, created functions for region reports (US state-level data)

2020-04-03 12:59 - First version, scrape of PDF and extract of data into CSV

How to use

The R/functions.R script provides a number of functions to interact with the Google COVI19 Community Mobility Reports:

  • get_country_list() gets a list of the country reports available
  • get_national_data() extracts the overall figures from a country report
  • get_subnational_data() extracts the locality figures from a country report
  • get_region_list() gets a list of the region reports available (currently just US states)
  • get_region_data() extracts the overall figures from a region report
  • get_subregion_data() extracts the locality figures from a region report

The functions return tibbles providing the headline mobility report figures, they do not extract or interact with the trend-lines provided in the chart reports. The tibbles have the following columns:

  • date: the date from the PDF file name
  • country: the ISO 2-character country code from the PDF file name
  • region: for region reports the region name
  • entity: the datapoint label, one of
  • value: the datapoint value, these are presented as percentages in the report but are converted to decimal representation in the tables

There are six mobility entities presented in the reports:

entity value Description
retail_recr Retail & recreation: Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters
grocery_pharm Grocery & pharmacy: Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies.
parks Parks: Mobility trends for places like national parks, public beaches, marinas, dog parks, plazas, and public gardens.
transit Transit stations: Mobility trends for places like public transport hubs such as subway, bus, and train stations.
workplace Workplaces: Mobility trends for places of work.
residential Residential: Mobility trends for places of residence.

Example code

This code is also provided in mobility_report_scraping.R

library(tidyverse)
source("R/functions.R")

# get list of countries
# default url is https://www.google.com/covid19/mobility/
countries <- get_country_list()

# extract the url for the uk
uk_url <- countries %>% filter(country == "GB") %>% pull(url)

# extract overall data for the uk
uk_overall_data <- get_national_data(uk_url)

# extract locality data for the uk
uk_location_data <- get_subnational_data(uk_url)

# get list of us states
states <- get_region_list()

# extract the url for new york
ny_url <- states %>% filter(region == "New York") %>% pull(url)

# extract overall data for new york state
ny_data <- get_region_data(ny_url)

# extract locality data for new york state
ny_locality_data <- get_subregion_data(ny_url)