- Installation
- Usage
- Example 1: replicate
nflscrapR
withfast_scraper
- Example 2: scrape a batch of games very quickly with
fast_scraper
and parallel processing - Example 3: completion percentage over expected (CPOE)
- Example 4: using drive information
- Example 5: Plot offensive and defensive EPA per play for a given season
- Example 6: Working with roster and position data
- Example 1: replicate
nflfastR
models- Data repository
- About
- Special thanks
nflfastR
is a set of functions to efficiently scrape NFL play-by-play
data. nflfastR
expands upon the features of nflscrapR:
- The package contains NFL play-by-play data back to 1999
- As suggested by the package name, it obtains games much faster
- Includes completion probability (
cp
) and completion percentage over expected (cpoe
) in play-by-play going back to 2006 - Includes drive information, including drive starting position and drive result
- Includes series information, including series number and series success
- Hosts a repository of play-by-play data going back to 1999 for very quick access
We owe a debt of gratitude to the original
nflscrapR
team, Maksim
Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions
and inspiration this package would not exist.
You can load and install nflfastR from GitHub with:
# If 'devtools' isn't installed run
# install.packages("devtools")
devtools::install_github("mrcaseb/nflfastR")
library(nflfastR)
library(tidyverse)
The functionality of nflscrapR
can be duplicated by using
fast_scraper
This obtains the same information contained in
nflscrapR
(plus some extra) but much more quickly. To compare to
nflscrapR
, we use their data repository as the program no longer
functions now that the NFL has taken down the old Gamecenter feed. Note
that EP differs from nflscrapR as we use a newer era-adjusted model
(more on this below).
This example also uses the built-in function clean_pbp
to create a
"name’ column for the primary player involved (the QB on pass play or
ball-carrier on run play).
read_csv(url('https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true')) %>%
filter(home_team == 'SF' & away_team == 'SEA') %>%
select(desc, play_type, ep, epa, home_wp) %>% head(5) %>%
knitr::kable(digits = 3)
desc | play_type | ep | epa | home_wp |
---|---|---|---|---|
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.815 | 0.000 | NA |
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | 0.815 | -0.606 | 0.500 |
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | 0.209 | -1.146 | 0.485 |
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | -0.937 | 3.223 | 0.453 |
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 2.286 | 0.774 | 0.551 |
fast_scraper('2019_10_SEA_SF') %>%
clean_pbp() %>%
select(desc, play_type, ep, epa, home_wp, name) %>% head(6) %>%
knitr::kable(digits = 3)
desc | play_type | ep | epa | home_wp | name |
---|---|---|---|---|---|
GAME | NA | NA | NA | NA | NA |
5-J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 1.483 | 0.000 | 0.560 | NA |
(15:00) 26-T.Coleman left guard to SF 26 for 1 yard (90-J.Clowney). | run | 1.483 | -0.581 | 0.560 | T.Coleman |
(14:19) 26-T.Coleman right tackle to SF 25 for -1 yards (97-P.Ford). | run | 0.903 | -0.810 | 0.543 | T.Coleman |
(13:45) (Shotgun) 10-J.Garoppolo pass short middle to 84-K.Bourne to SF 41 for 16 yards (24-J.Taylor). Caught at SF39. 2-yac | pass | 0.092 | 2.349 | 0.506 | J.Garoppolo |
(12:58) PENALTY on SEA-91-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 2.441 | 0.820 | 0.591 | NA |
This is a demonstration of nflfastR
’s capabilities. While nflfastR
can scrape a batch of games very quickly, please be respectful of
Github’s servers and use the data
repository which hosts all
the scraped and cleaned data whenever possible. The only reason to
ever actually use the scraper is if it’s in the middle of the season and
we haven’t updated the repository with recent games (but we will try to
keep it updated).
#get list of some games from 2019
games_2019 <- fast_scraper_schedules(2019) %>% head(10) %>% pull(game_id)
tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:'))
f <- fast_scraper(games_2019, pp = TRUE)
tictoc::toc()
#> 10 games with nflfastR:: 14.46 sec elapsed
Let’s look at CPOE leaders from the 2009 regular season.
As discussed above, nflfastR
has a data repository for old seasons, so
there’s no need to actually scrape them. Let’s use that here (the below
reads .rds files, but .csv is also available).
tictoc::tic('loading all games from 2009')
games_2009 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2009.rds')) %>% filter(season_type == 'REG')
tictoc::toc()
#> loading all games from 2009: 2.8 sec elapsed
games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>%
summarize(cpoe = mean(cpoe), Atts=n()) %>%
filter(Atts > 200) %>%
arrange(-cpoe) %>%
head(5) %>%
knitr::kable(digits = 1)
passer_player_name | cpoe | Atts |
---|---|---|
D.Brees | 7.4 | 509 |
P.Manning | 6.7 | 569 |
P.Rivers | 6.6 | 474 |
B.Favre | 6.0 | 527 |
M.Schaub | 5.3 | 571 |
When scraping from the default RS feed, drive results are automatically included. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000.
games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds'))
games_2015 <-readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds'))
pbp <- bind_rows(games_2000, games_2015)
pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
mutate(drive_score = if_else(drive_end_transition %in% c("Touchdown", "Field Goal", "TOUCHDOWN", "FIELD_GOAL"), 1, 0)) %>%
group_by(season) %>%
summarize(drive_score = mean(drive_score)) %>%
knitr::kable(digits = 3)
season | drive_score |
---|---|
2000 | 0.234 |
2015 | 0.305 |
So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for EPA models (see below).
Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the ‘team_colors_logos’ data frame which is delivered with the package.
Let’s also use the included helper function clean_pbp
, which creates
“rush” and “pass” columns that (a) properly count sacks and scrambles
as pass plays and (b) properly include plays with penalties. Using this,
we can keep only rush or pass plays.
library(ggimage)
pbp <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds')) %>%
filter(season_type == 'REG') %>% filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>% group_by(posteam) %>% summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>% group_by(defteam) %>% summarise(def_epa = mean(epa, na.rm = TRUE))
logos <- teams_colors_logos %>% select(team_abbr, team_logo_espn)
offense %>%
inner_join(defense, by = c("posteam" = "defteam")) %>%
inner_join(logos, by = c("posteam" = "team_abbr")) %>%
ggplot(aes(x = off_epa, y = def_epa)) +
geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
geom_hline(aes(yintercept = mean(off_epa)), color = "red", linetype = "dashed") +
geom_vline(aes(xintercept = mean(def_epa)), color = "red", linetype = "dashed") +
geom_image(aes(image = team_logo_espn), size = 0.05, asp = 16 / 9) +
labs(
x = "Offense EPA/play",
y = "Defense EPA/play",
caption = "Data: @nflfastR",
title = "2005 NFL Offensive and Defensive EPA per Play"
) +
theme_bw() +
theme(
aspect.ratio = 9 / 16,
plot.title = element_text(size = 12, hjust = 0.5, face = "bold")
) +
scale_y_reverse()
This section used to contain an example of working with roster data. Unfortunately, we have not found a way to obtain roster data that can be joined to the new play by play, so for now, it is empty. We would like to be able to get position data but haven’t yet.
The clean_pbp()
function does a lot of work cleaning up player names
and IDs for the purpose of joining them to roster data, but we do not
have any roster data to join to. Note that player IDs are inconsistent
between the old (1999-2010) and new (2011 - present) data sources so use
IDs with caution. Unfortunately there is nothing we can do about this
as the NFL changed their system for IDs in the underlying data.
nflfastR
uses its own models for Expected Points, Win Probability, and
Completion Percentage. To read about the models, please see
here.
For a more detailed description of Expected Points models, we highly
recommend this paper from the nflscrapR team located
here.
nflfastR
includes two win probability models: one with and one without
incorporating the pre-game spread.
fast_scraper('2019_03_NYJ_NE') %>%
select(desc, spread_line, home_wp, vegas_home_wp) %>%
head(5) %>%
knitr::kable(digits = 3)
desc | spread_line | home_wp | vegas_home_wp |
---|---|---|---|
GAME | 20.5 | NA | NA |
3-S.Gostkowski kicks 65 yards from NE 35 to end zone, Touchback. | 20.5 | 0.563 | 0.858 |
(15:00) 26-L.Bell up the middle to NYJ 32 for 7 yards (71-D.Shelton; 93-L.Guy). | 20.5 | 0.563 | 0.858 |
(14:29) 8-L.Falk pass short right to 82-J.Crowder pushed ob at NYJ 47 for 15 yards (31-J.Jones). | 20.5 | 0.552 | 0.876 |
(13:59) 26-L.Bell up the middle to NYJ 48 for 1 yard (54-D.Hightower, 55-J.Simon). | 20.5 | 0.516 | 0.780 |
Because the Patriots were favored by 20.5 points in this Week 3
Jets-Patriots game, the naive (no-spread) and spread win probabilities
differ dramatically. In the no-spread model, home_wp
begins at 0.56
(greater than 0.50 because home team advantage is included in the model)
and vegas_home_wp
is 0.858. We have also included vegas_wp
which is
the win probability of the possession team, incorporating the spread.
Even though nflfastR
is very fast, for historical games we recommend
downloading the data from
here as in Examples 4 and 5
above. These data sets include play-by-play data of complete seasons
going back to 1999 and we will update them in 2020 once the season
starts. The files contain both regular season and postseason data, and
one can use game_type or week to figure out which games occurred in the
postseason. Data are available as .csv.gz, .parquet, or .rds.
nflfastR
was developed by Sebastian
Carl and Ben
Baldwin.
- To Nick Shoemaker for finding
and making available JSON-formatted NFL play-by-play back
to 1999 (
nflfastR
uses this source for 1999-2010) - To Lee Sharpe for curating a resource for game information
- To Timo Riske, Lau Sze
Yui, Sean
Clement, and Daniel
Houston for many helpful
discussions regarding the development of the new
nflfastR
models - To Zach Feldman and Josh Hermsmeyer for many helpful discussions about CPOE models
- To Florian Schmitt for the logo design
- To Peter Owen for many helpful suggestions for the CP model
- The many users who found and reported bugs in
nflfastR
1.0 - And of course, the original
nflscrapR
team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, whose work represented a dramatic step forward for the state of public NFL research