[BUG] nflfastr::calculate_player_stats returns duplicate rows for defense and kicker
isaactpetersen opened this issue · 4 comments
Is there an existing issue for this?
- I have searched the existing issues
If this is a data issue, have you tried clearing your nflverse cache?
I have cleared my nflverse cache and the issue persists.
What version of the package do you have?
nflreadr
1.4.1
Describe the bug
There are duplicated combinations of player_id
-season
-week
combinations in the player stats database (from the load_player_stats()
function). I cannot think of a reason why the same player would have multiple rows for a given season and week combination. If (as I suspect), this is not possible, then this would be a data issue to fix. If I'm incorrect and it is plausible that the same player could have multiple rows for a given season and week combination, then it would be helpful to know the circumstances when this could arise. This is important for merging with other datasets to ensure I am merging the information to the correct player_id
-season
-week
combination.
Reprex
library("nflreadr")
library("dplyr")
#> Warning: package 'dplyr' was built under R version 4.3.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Load Data
offenseStats_weekly <- load_player_stats(
seasons = TRUE,
stat_type = "offense")
defenseStats_weekly <- load_player_stats(
seasons = TRUE,
stat_type = "defense")
kickingStats_weekly <- load_player_stats(
seasons = TRUE,
stat_type = "kicking")
# Rearrange variables
offenseStats_weekly <- offenseStats_weekly %>%
select(player_id, season, week, everything())
defenseStats_weekly <- defenseStats_weekly %>%
select(player_id, season, week, everything())
kickingStats_weekly <- kickingStats_weekly %>%
select(player_id, season, week, everything())
# Offense: No duplicate id-season-week combinations
offenseStats_weekly %>%
group_by(player_id, season, week) %>%
filter(n() > 1)
#> # A tibble: 0 × 53
#> # Groups: player_id, season, week [0]
#> # ℹ 53 variables: player_id <chr>, season <int>, week <int>, player_name <chr>,
#> # player_display_name <chr>, position <chr>, position_group <chr>,
#> # headshot_url <chr>, recent_team <chr>, season_type <chr>,
#> # opponent_team <chr>, completions <int>, attempts <int>,
#> # passing_yards <dbl>, passing_tds <int>, interceptions <dbl>, sacks <dbl>,
#> # sack_yards <dbl>, sack_fumbles <int>, sack_fumbles_lost <int>,
#> # passing_air_yards <dbl>, passing_yards_after_catch <dbl>, …
# Defense
defenseStats_weekly %>%
group_by(player_id, season, week) %>%
filter(n() > 1)
#> # A tibble: 496 × 32
#> # Groups: player_id, season, week [183]
#> player_id season week season_type player_name player_display_name position
#> <chr> <int> <int> <chr> <chr> <chr> <chr>
#> 1 0 1999 1 REG <NA> <NA> <NA>
#> 2 0 1999 1 REG <NA> <NA> <NA>
#> 3 0 1999 1 REG <NA> <NA> <NA>
#> 4 0 1999 1 REG <NA> <NA> <NA>
#> 5 0 1999 1 REG <NA> <NA> <NA>
#> 6 0 1999 1 REG <NA> <NA> <NA>
#> 7 0 1999 1 REG <NA> <NA> <NA>
#> 8 0 1999 1 REG <NA> <NA> <NA>
#> 9 0 1999 2 REG <NA> <NA> <NA>
#> 10 0 1999 2 REG <NA> <NA> <NA>
#> # ℹ 486 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> # def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> # def_tackle_assists <int>, def_tackles_for_loss <int>,
#> # def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> # def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> # def_interceptions <dbl>, def_interception_yards <dbl>, …
defenseStats_weekly %>%
group_by(player_id, season, week) %>%
filter(n() > 1, player_id != 0) #not sure why there are playerIDs of "0"; exclude them
#> # A tibble: 296 × 32
#> # Groups: player_id, season, week [148]
#> player_id season week season_type player_name player_display_name position
#> <chr> <int> <int> <chr> <chr> <chr> <chr>
#> 1 00-0002919 1999 4 REG <NA> Corey Chavous SS
#> 2 00-0002919 1999 4 REG <NA> Corey Chavous SS
#> 3 00-0004543 1999 12 REG <NA> Shane Dronett DT
#> 4 00-0004543 1999 12 REG <NA> Shane Dronett DT
#> 5 00-0004915 1999 16 REG <NA> Bobby Engram WR
#> 6 00-0004915 1999 16 REG <NA> Bobby Engram WR
#> 7 00-0010668 1999 20 POST <NA> Keenan McCardell WR
#> 8 00-0010668 1999 20 POST <NA> Keenan McCardell WR
#> 9 00-0011392 1999 14 REG <NA> Basil Mitchell RB
#> 10 00-0011392 1999 14 REG <NA> Basil Mitchell RB
#> # ℹ 286 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> # def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> # def_tackle_assists <int>, def_tackles_for_loss <int>,
#> # def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> # def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> # def_interceptions <dbl>, def_interception_yards <dbl>, …
# Kicking
kickingStats_weekly %>%
group_by(player_id, season, week) %>%
filter(n() > 1)
#> # A tibble: 4 × 44
#> # Groups: player_id, season, week [2]
#> player_id season week season_type team player_name player_display_name
#> <chr> <int> <int> <chr> <chr> <chr> <chr>
#> 1 00-0004811 2000 11 REG DEN <NA> Jason Elam
#> 2 00-0004811 2000 11 REG LV <NA> Jason Elam
#> 3 00-0012875 2002 4 REG PIT <NA> Todd Peterson
#> 4 00-0012875 2002 4 REG PIT <NA> Todd Peterson
#> # ℹ 37 more variables: position <chr>, position_group <chr>,
#> # headshot_url <chr>, fg_made <int>, fg_att <dbl>, fg_missed <int>,
#> # fg_blocked <int>, fg_long <dbl>, fg_pct <dbl>, fg_made_0_19 <int>,
#> # fg_made_20_29 <int>, fg_made_30_39 <int>, fg_made_40_49 <int>,
#> # fg_made_50_59 <int>, fg_made_60_ <int>, fg_missed_0_19 <int>,
#> # fg_missed_20_29 <int>, fg_missed_30_39 <int>, fg_missed_40_49 <int>,
#> # fg_missed_50_59 <int>, fg_missed_60_ <int>, fg_made_list <chr>, …
sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: America/Chicago
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 nflreadr_1.4.1
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.3 knitr_1.48 rlang_1.1.4
#> [5] xfun_0.46 generics_0.1.3 data.table_1.15.4 glue_1.7.0
#> [9] htmltools_0.5.8.1 fansi_1.0.6 rmarkdown_2.27 evaluate_0.24.0
#> [13] tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
#> [17] memoise_2.0.1 compiler_4.3.1 fs_1.6.4 pkgconfig_2.0.3
#> [21] rstudioapi_0.16.0 digest_0.6.36 R6_2.5.1 tidyselect_1.2.1
#> [25] reprex_2.1.1 utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3
#> [29] tools_4.3.1 withr_3.0.0 cachem_1.1.0
Created on 2024-07-31 with reprex v2.1.1
Expected Behavior
I expect each player (i.e., player_id
) to have only one row for a given season
-week
combination.
nflverse_sitrep
> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.1 (2023-06-16 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package installed cran dev behind
1 nfl4th 1.0.4 1.0.4 1.0.4.9002 dev
2 nflfastR 4.6.1 4.6.1 4.6.1.9010 dev
3 nflplotR 1.3.1 1.3.1 1.3.1
4 nflreadr 1.4.1 1.4.1 1.4.1.00
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass (1.2.0) • httr (1.4.7) • stringi (1.8.4)
• backports (1.5.0) • isoband (0.2.7) • stringr (1.5.1)
• base64enc (0.1-3) • janitor (2.2.0) • sys (3.4.2)
• bigD (0.2.0) • jquerylib (0.1.4) • tibble (3.2.1)
• bitops (1.0-8) • jsonlite (1.8.8) • tidyr (1.3.1)
• bslib (0.8.0) • juicyjuice (0.1.0) • tidyselect (1.2.1)
• cachem (1.1.0) • knitr (1.48) • timechange (0.3.0)
• cli (3.6.3) • labeling (0.4.3) • tinytex (0.52)
• colorspace (2.1-1) • lifecycle (1.0.4) • utf8 (1.2.4)
• commonmark (1.9.1) • listenv (0.9.1) • V8 (4.4.2)
• cpp11 (0.4.7) • lubridate (1.9.3) • vctrs (0.6.5)
• curl (5.2.1) • magick (2.8.4) • viridisLite (0.4.2)
• data.table (1.15.4) • magrittr (2.0.3) • withr (3.0.0)
• digest (0.6.36) • markdown (1.13) • xfun (0.46)
• dplyr (1.1.4) • Matrix (1.6-5) • xgboost (1.7.8.1)
• evaluate (0.24.0) • memoise (2.0.1) • xml2 (1.3.6)
• fansi (1.0.6) • mime (0.12) • yaml (2.3.10)
• farver (2.1.2) • munsell (0.5.1) • codetools (0.2-20)
• fastmap (1.2.0) • openssl (2.2.0) • compiler (4.3.1)
• fastrmodels (1.0.2) • parallelly (1.38.0) • graphics (4.3.1)
• fontawesome (0.5.2) • pillar (1.9.0) • grDevices (4.3.1)
• fs (1.6.4) • pkgconfig (2.0.3) • grid (4.3.1)
• furrr (0.3.1) • progressr (0.14.0) • lattice (0.22-6)
• future (1.34.0) • purrr (1.0.2) • MASS (7.3-60.0.1)
• generics (0.1.3) • R6 (2.5.1) • Matrix (1.6-5)
• ggpath (1.0.1) • rappdirs (0.3.3) • methods (4.3.1)
• ggplot2 (3.5.1) • RColorBrewer (1.1-3) • mgcv (1.9-1)
• globals (0.16.3) • Rcpp (1.0.13) • nlme (3.1-165)
• glue (1.7.0) • reactable (0.4.4) • parallel (4.3.1)
• gt (0.11.0) • reactR (0.6.0) • splines (4.3.1)
• gtable (0.3.5) • rlang (1.1.4) • stats (4.3.1)
• highr (0.11) • rmarkdown (2.27) • tools (4.3.1)
• hms (1.1.3) • sass (0.4.9) • utils (4.3.1)
• htmltools (0.5.8.1) • scales (1.3.0)
• htmlwidgets (1.6.4) • snakecase (0.11.1)
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR ()
• nflverse ()
Screenshots
No response
Additional context
No response
Relocating to nflfastR repo
Looking at the problematic defense data. It seems like players get attributed to the opponent team in some cases when they get a fumble recovery or penalty.
CORRECTION: I think we assign tackles after turnovers to the wrong team
So the main thing might be that an offensive player scores a defensive stat after the offense turned over the ball