/nflfastR

A Set of Functions to Efficiently Scrape NFL Play by Play Data

Primary LanguageROtherNOASSERTION

nflfastR

nflfastR is a set of functions to efficiently scrape NFL play-by-play data. nflfastR expands upon the features of nflscrapR:

  • The package contains NFL play-by-play data back to 1999
  • As suggested by the package name, it obtains games much faster
  • Includes completion probability (cp) and completion percentage over expected (cpoe) in play-by-play going back to 2006
  • Includes drive information, including drive starting position and drive result
  • Includes series information, including series number and series success
  • Hosts a repository of play-by-play data going back to 1999 for very quick access

We owe a debt of gratitude to the original nflscrapR team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions and inspiration this package would not exist.

Installation

You can load and install nflfastR from GitHub with:

# If 'devtools' isn't installed run
# install.packages("devtools")

devtools::install_github("mrcaseb/nflfastR")

Usage

library(nflfastR)
library(tidyverse)

Example 1: replicate nflscrapR with fast_scraper

The functionality of nflscrapR can be duplicated by using fast_scraper This obtains the same information contained in nflscrapR (plus some extra) but much more quickly. To compare to nflscrapR, we use their data repository as the program no longer functions now that the NFL has taken down the old Gamecenter feed. Note that EP differs from nflscrapR as we use a newer era-adjusted model (more on this below).

This example also uses the built-in function clean_pbp to create a "name’ column for the primary player involved (the QB on pass play or ball-carrier on run play).

read_csv(url('https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true')) %>%
  filter(home_team == 'SF' & away_team == 'SEA') %>%
  select(desc, play_type, ep, epa, home_wp) %>% head(5) %>% 
  knitr::kable(digits = 3)
desc play_type ep epa home_wp
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. kickoff 0.815 0.000 NA
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). run 0.815 -0.606 0.500
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). run 0.209 -1.146 0.485
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac pass -0.937 3.223 0.453
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. no_play 2.286 0.774 0.551
fast_scraper('2019_10_SEA_SF') %>%
  clean_pbp() %>%
  select(desc, play_type, ep, epa, home_wp, name) %>% head(6) %>% 
  knitr::kable(digits = 3)
desc play_type ep epa home_wp name
GAME NA NA NA NA NA
5-J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. kickoff 1.483 0.000 0.560 NA
(15:00) 26-T.Coleman left guard to SF 26 for 1 yard (90-J.Clowney). run 1.483 -0.581 0.560 T.Coleman
(14:19) 26-T.Coleman right tackle to SF 25 for -1 yards (97-P.Ford). run 0.903 -0.810 0.543 T.Coleman
(13:45) (Shotgun) 10-J.Garoppolo pass short middle to 84-K.Bourne to SF 41 for 16 yards (24-J.Taylor). Caught at SF39. 2-yac pass 0.092 2.349 0.506 J.Garoppolo
(12:58) PENALTY on SEA-91-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. no_play 2.441 0.820 0.591 NA

Example 2: scrape a batch of games very quickly with fast_scraper and parallel processing

This is a demonstration of nflfastR’s capabilities. While nflfastR can scrape a batch of games very quickly, please be respectful of Github’s servers and use the data repository which hosts all the scraped and cleaned data whenever possible. The only reason to ever actually use the scraper is if it’s in the middle of the season and we haven’t updated the repository with recent games (but we will try to keep it updated).

#get list of some games from 2019
games_2019 <- fast_scraper_schedules(2019) %>% head(10) %>% pull(game_id)

tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:'))
f <- fast_scraper(games_2019, pp = TRUE)
tictoc::toc()
#> 10 games with nflfastR:: 14.46 sec elapsed

Example 3: completion percentage over expected (CPOE)

Let’s look at CPOE leaders from the 2009 regular season.

As discussed above, nflfastR has a data repository for old seasons, so there’s no need to actually scrape them. Let’s use that here (the below reads .rds files, but .csv is also available).

tictoc::tic('loading all games from 2009')
games_2009 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2009.rds')) %>% filter(season_type == 'REG')
tictoc::toc()
#> loading all games from 2009: 2.8 sec elapsed
games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>%
  summarize(cpoe = mean(cpoe), Atts=n()) %>%
  filter(Atts > 200) %>%
  arrange(-cpoe) %>%
  head(5) %>% 
  knitr::kable(digits = 1)
passer_player_name cpoe Atts
D.Brees 7.4 509
P.Manning 6.7 569
P.Rivers 6.6 474
B.Favre 6.0 527
M.Schaub 5.3 571

Example 4: using drive information

When scraping from the default RS feed, drive results are automatically included. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000.

games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds'))
games_2015 <-readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds'))

pbp <- bind_rows(games_2000, games_2015)

pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
  mutate(drive_score = if_else(drive_end_transition %in% c("Touchdown", "Field Goal", "TOUCHDOWN", "FIELD_GOAL"), 1, 0)) %>%
  group_by(season) %>%
  summarize(drive_score = mean(drive_score)) %>% 
  knitr::kable(digits = 3)
season drive_score
2000 0.234
2015 0.305

So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for EPA models (see below).

Example 5: Plot offensive and defensive EPA per play for a given season

Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the ‘team_colors_logos’ data frame which is delivered with the package.

Let’s also use the included helper function clean_pbp, which creates “rush” and “pass” columns that (a) properly count sacks and scrambles as pass plays and (b) properly include plays with penalties. Using this, we can keep only rush or pass plays.

library(ggimage)
pbp <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds')) %>%
  filter(season_type == 'REG') %>% filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>% group_by(posteam) %>% summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>% group_by(defteam) %>% summarise(def_epa = mean(epa, na.rm = TRUE))
logos <- teams_colors_logos %>% select(team_abbr, team_logo_espn)

offense %>%
  inner_join(defense, by = c("posteam" = "defteam")) %>%
  inner_join(logos, by = c("posteam" = "team_abbr")) %>%
  ggplot(aes(x = off_epa, y = def_epa)) +
  geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
  geom_hline(aes(yintercept = mean(off_epa)), color = "red", linetype = "dashed") +
  geom_vline(aes(xintercept = mean(def_epa)), color = "red", linetype = "dashed") +
  geom_image(aes(image = team_logo_espn), size = 0.05, asp = 16 / 9) +
  labs(
    x = "Offense EPA/play",
    y = "Defense EPA/play",
    caption = "Data: @nflfastR",
    title = "2005 NFL Offensive and Defensive EPA per Play"
  ) +
  theme_bw() +
  theme(
    aspect.ratio = 9 / 16,
    plot.title = element_text(size = 12, hjust = 0.5, face = "bold")
  ) +
  scale_y_reverse()

Example 6: Working with roster and position data

This section used to contain an example of working with roster data. Unfortunately, we have not found a way to obtain roster data that can be joined to the new play by play, so for now, it is empty. We would like to be able to get position data but haven’t yet.

The clean_pbp() function does a lot of work cleaning up player names and IDs for the purpose of joining them to roster data, but we do not have any roster data to join to. Note that player IDs are inconsistent between the old (1999-2010) and new (2011 - present) data sources so use IDs with caution. Unfortunately there is nothing we can do about this as the NFL changed their system for IDs in the underlying data.

nflfastR models

nflfastR uses its own models for Expected Points, Win Probability, and Completion Percentage. To read about the models, please see here. For a more detailed description of Expected Points models, we highly recommend this paper from the nflscrapR team located here.

nflfastR includes two win probability models: one with and one without incorporating the pre-game spread.

fast_scraper('2019_03_NYJ_NE') %>%
  select(desc, spread_line, home_wp, vegas_home_wp) %>% 
  head(5) %>%
  knitr::kable(digits = 3)
desc spread_line home_wp vegas_home_wp
GAME 20.5 NA NA
3-S.Gostkowski kicks 65 yards from NE 35 to end zone, Touchback. 20.5 0.563 0.858
(15:00) 26-L.Bell up the middle to NYJ 32 for 7 yards (71-D.Shelton; 93-L.Guy). 20.5 0.563 0.858
(14:29) 8-L.Falk pass short right to 82-J.Crowder pushed ob at NYJ 47 for 15 yards (31-J.Jones). 20.5 0.552 0.876
(13:59) 26-L.Bell up the middle to NYJ 48 for 1 yard (54-D.Hightower, 55-J.Simon). 20.5 0.516 0.780

Because the Patriots were favored by 20.5 points in this Week 3 Jets-Patriots game, the naive (no-spread) and spread win probabilities differ dramatically. In the no-spread model, home_wp begins at 0.56 (greater than 0.50 because home team advantage is included in the model) and vegas_home_wp is 0.858. We have also included vegas_wp which is the win probability of the possession team, incorporating the spread.

Data repository

Even though nflfastR is very fast, for historical games we recommend downloading the data from here as in Examples 4 and 5 above. These data sets include play-by-play data of complete seasons going back to 1999 and we will update them in 2020 once the season starts. The files contain both regular season and postseason data, and one can use game_type or week to figure out which games occurred in the postseason. Data are available as .csv.gz, .parquet, or .rds.

About

nflfastR was developed by Sebastian Carl and Ben Baldwin.

Special thanks