nflverse/nflfastR

[BUG] nflfastr::calculate_player_stats returns duplicate rows for defense and kicker

isaactpetersen opened this issue · 4 comments

Is there an existing issue for this?

  • I have searched the existing issues

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

nflreadr 1.4.1

Describe the bug

There are duplicated combinations of player_id-season-week combinations in the player stats database (from the load_player_stats() function). I cannot think of a reason why the same player would have multiple rows for a given season and week combination. If (as I suspect), this is not possible, then this would be a data issue to fix. If I'm incorrect and it is plausible that the same player could have multiple rows for a given season and week combination, then it would be helpful to know the circumstances when this could arise. This is important for merging with other datasets to ensure I am merging the information to the correct player_id-season-week combination.

Reprex

library("nflreadr")
library("dplyr")
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Load Data
offenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "offense")

defenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "defense")

kickingStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "kicking")

# Rearrange variables
offenseStats_weekly <- offenseStats_weekly %>% 
  select(player_id, season, week, everything())

defenseStats_weekly <- defenseStats_weekly %>% 
  select(player_id, season, week, everything())

kickingStats_weekly <- kickingStats_weekly %>% 
  select(player_id, season, week, everything())

# Offense: No duplicate id-season-week combinations
offenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 0 × 53
#> # Groups:   player_id, season, week [0]
#> # ℹ 53 variables: player_id <chr>, season <int>, week <int>, player_name <chr>,
#> #   player_display_name <chr>, position <chr>, position_group <chr>,
#> #   headshot_url <chr>, recent_team <chr>, season_type <chr>,
#> #   opponent_team <chr>, completions <int>, attempts <int>,
#> #   passing_yards <dbl>, passing_tds <int>, interceptions <dbl>, sacks <dbl>,
#> #   sack_yards <dbl>, sack_fumbles <int>, sack_fumbles_lost <int>,
#> #   passing_air_yards <dbl>, passing_yards_after_catch <dbl>, …

# Defense
defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 496 × 32
#> # Groups:   player_id, season, week [183]
#>    player_id season  week season_type player_name player_display_name position
#>    <chr>      <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  2 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  3 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  4 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  5 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  6 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  7 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  8 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  9 0           1999     2 REG         <NA>        <NA>                <NA>    
#> 10 0           1999     2 REG         <NA>        <NA>                <NA>    
#> # ℹ 486 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1, player_id != 0) #not sure why there are playerIDs of "0"; exclude them
#> # A tibble: 296 × 32
#> # Groups:   player_id, season, week [148]
#>    player_id  season  week season_type player_name player_display_name position
#>    <chr>       <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  2 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  3 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  4 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  5 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  6 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  7 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  8 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  9 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> 10 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> # ℹ 286 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

# Kicking

kickingStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 4 × 44
#> # Groups:   player_id, season, week [2]
#>   player_id  season  week season_type team  player_name player_display_name
#>   <chr>       <int> <int> <chr>       <chr> <chr>       <chr>              
#> 1 00-0004811   2000    11 REG         DEN   <NA>        Jason Elam         
#> 2 00-0004811   2000    11 REG         LV    <NA>        Jason Elam         
#> 3 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> 4 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> # ℹ 37 more variables: position <chr>, position_group <chr>,
#> #   headshot_url <chr>, fg_made <int>, fg_att <dbl>, fg_missed <int>,
#> #   fg_blocked <int>, fg_long <dbl>, fg_pct <dbl>, fg_made_0_19 <int>,
#> #   fg_made_20_29 <int>, fg_made_30_39 <int>, fg_made_40_49 <int>,
#> #   fg_made_50_59 <int>, fg_made_60_ <int>, fg_missed_0_19 <int>,
#> #   fg_missed_20_29 <int>, fg_missed_30_39 <int>, fg_missed_40_49 <int>,
#> #   fg_missed_50_59 <int>, fg_missed_60_ <int>, fg_made_list <chr>, …

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4    nflreadr_1.4.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4      
#>  [5] xfun_0.46         generics_0.1.3    data.table_1.15.4 glue_1.7.0       
#>  [9] htmltools_0.5.8.1 fansi_1.0.6       rmarkdown_2.27    evaluate_0.24.0  
#> [13] tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
#> [17] memoise_2.0.1     compiler_4.3.1    fs_1.6.4          pkgconfig_2.0.3  
#> [21] rstudioapi_0.16.0 digest_0.6.36     R6_2.5.1          tidyselect_1.2.1 
#> [25] reprex_2.1.1      utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
#> [29] tools_4.3.1       withr_3.0.0       cachem_1.1.0

Created on 2024-07-31 with reprex v2.1.1

Expected Behavior

I expect each player (i.e., player_id) to have only one row for a given season-week combination.

nflverse_sitrep

> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.1 (2023-06-16 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1   nfl4th     1.0.4 1.0.4 1.0.4.9002    dev
2 nflfastR     4.6.1 4.6.1 4.6.1.9010    dev
3 nflplotR     1.3.1 1.3.1      1.3.1       
4 nflreadr     1.4.1 1.4.1   1.4.1.00       
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass     (1.2.0)    • httr         (1.4.7)   • stringi     (1.8.4)       
• backports   (1.5.0)    • isoband      (0.2.7)   • stringr     (1.5.1)       
• base64enc   (0.1-3)    • janitor      (2.2.0)   • sys         (3.4.2)       
• bigD        (0.2.0)    • jquerylib    (0.1.4)   • tibble      (3.2.1)       
• bitops      (1.0-8)    • jsonlite     (1.8.8)   • tidyr       (1.3.1)       
• bslib       (0.8.0)    • juicyjuice   (0.1.0)   • tidyselect  (1.2.1)       
• cachem      (1.1.0)    • knitr        (1.48)    • timechange  (0.3.0)       
• cli         (3.6.3)    • labeling     (0.4.3)   • tinytex     (0.52)        
• colorspace  (2.1-1)    • lifecycle    (1.0.4)   • utf8        (1.2.4)       
• commonmark  (1.9.1)    • listenv      (0.9.1)   • V8          (4.4.2)       
• cpp11       (0.4.7)    • lubridate    (1.9.3)   • vctrs       (0.6.5)       
• curl        (5.2.1)    • magick       (2.8.4)   • viridisLite (0.4.2)       
• data.table  (1.15.4)   • magrittr     (2.0.3)   • withr       (3.0.0)       
• digest      (0.6.36)   • markdown     (1.13)    • xfun        (0.46)        
• dplyr       (1.1.4)    • Matrix       (1.6-5)   • xgboost     (1.7.8.1)     
• evaluate    (0.24.0)   • memoise      (2.0.1)   • xml2        (1.3.6)       
• fansi       (1.0.6)    • mime         (0.12)    • yaml        (2.3.10)      
• farver      (2.1.2)    • munsell      (0.5.1)   • codetools   (0.2-20)      
• fastmap     (1.2.0)    • openssl      (2.2.0)   • compiler    (4.3.1)       
• fastrmodels (1.0.2)    • parallelly   (1.38.0)  • graphics    (4.3.1)       
• fontawesome (0.5.2)    • pillar       (1.9.0)   • grDevices   (4.3.1)       
• fs          (1.6.4)    • pkgconfig    (2.0.3)   • grid        (4.3.1)       
• furrr       (0.3.1)    • progressr    (0.14.0)  • lattice     (0.22-6)      
• future      (1.34.0)   • purrr        (1.0.2)   • MASS        (7.3-60.0.1)  
• generics    (0.1.3)    • R6           (2.5.1)   • Matrix      (1.6-5)       
• ggpath      (1.0.1)    • rappdirs     (0.3.3)   • methods     (4.3.1)       
• ggplot2     (3.5.1)    • RColorBrewer (1.1-3)   • mgcv        (1.9-1)       
• globals     (0.16.3)   • Rcpp         (1.0.13)  • nlme        (3.1-165)     
• glue        (1.7.0)    • reactable    (0.4.4)   • parallel    (4.3.1)       
• gt          (0.11.0)   • reactR       (0.6.0)   • splines     (4.3.1)       
• gtable      (0.3.5)    • rlang        (1.1.4)   • stats       (4.3.1)       
• highr       (0.11)     • rmarkdown    (2.27)    • tools       (4.3.1)       
• hms         (1.1.3)    • sass         (0.4.9)   • utils       (4.3.1)       
• htmltools   (0.5.8.1)  • scales       (1.3.0)     
• htmlwidgets (1.6.4)    • snakecase    (0.11.1)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR ()
• nflverse ()

Screenshots

No response

Additional context

No response

Relocating to nflfastR repo

Looking at the problematic defense data. It seems like players get attributed to the opponent team in some cases when they get a fumble recovery or penalty.

CORRECTION: I think we assign tackles after turnovers to the wrong team

So the main thing might be that an offensive player scores a defensive stat after the offense turned over the ball

This might be quite hard to fix and we should probably invest the time in #470 instead

We will deprecate calculate_player_stats_*() functions in a future release. The new function calculate_stats() (#470 ) will fix the issue