jacquietran/wnblr

Incomplete / weird game stats have snuck into this package...

Closed this issue · 3 comments

Issues detected with the games listed below. In some cases, the game data needs to be removed entirely from the {wnblr} package. In other cases, it may be data in specific columns that needs to be adjusted or removed but other stats from the same game can be retained as they appear correct.

season page_id competing teams issue action status
2020 1777522 PER vs. MEL opposing team total player minutes are not equal, but other stats seem ok correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg
2018 1087574 ADL vs. MEL missing 2nd half of game remove this game from the package entirely fixed in scraping, need to migrate into pkg
2018 913531 BEN vs. TSV player minutes do not add up to an appropriate total, and opposing team minutes are not equal, but other stats seem ok correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg
2017 803093 PER vs. MEL missing the last ~7 min of the 4th quarter remove this game from the package entirely fixed in scraping, need to migrate into pkg
2015 270586 PER vs. TSV PER minutes are ~8 min lower than they should be for a 4-quarter game correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg
2015 137264 UCC vs. MEL UCC minutes are ~10 min lower than they should be for a 4-quarter game correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg
2014 64561 UCC vs. WCW UCC minutes are ~20 min lower than they should be for a 4-quarter game correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg
2014 64586 MEL vs. TSV MEL minutes are ~15 min lower than they should be for a 4-quarter game correct team minutes in box_scores, consider removing player minutes from box_scores_detailed? fixed in scraping, need to migrate into pkg

there's probably some general tidying up that can be done to the summed team minutes that are stored in box_scores. currently, summed team minutes are calculated from the total of player minutes within each team. this approach has been helpful for detecting errors, but the data also contains errors in this form that don't make sense with 10-minute quarters.

we can do a quick visual check for unique team minutes values:

box_scores %>%
  distinct(minutes, .keep_all = TRUE) %>%
  select(season, page_id, team_name, team_name_opp, minutes) %>%
  arrange(desc(season), desc(page_id)) %>%
  View()

excluding cases listed above that reflect incomplete records from fiba livestats, i think we can safely group and revalue team minutes in box_scores as follows:

  • if the game was played out in 4 quarters of regular time, then minutes == "200:00"
  • if the game went to OT1, then minutes == "225:00"
  • if the game went to OT2, then minutes == "250:00"

Making good progress - still need to:

  • Refactor 2019 scraping scripts to pull page_ids from the appropriate data frame
  • Revise 2019 team box scores scraping script with simpler determination of total team minutes
  • Rescrape 2019 team box scores data
  • Refactor 2020 scraping scripts to pull page_ids from the appropriate data frame
  • Revise 2020 team box scores scraping script with simpler determination of total team minutes
  • Rescrape 2020 team box scores data
  • Revise 2020 detailed team box scores scraping script to omit player minutes from page_id noted above
  • Rescrape 2020 detailed team box scores data

That got a little complicated...I think I'm done with rescraping tidied up data in relation to this issue.

In addressing the restructuring of the package (#32), the next release of the wnblr package will draw on updated data. Keeping this issue open for now since #32 is not done yet, but I don't think there is any further work required specific to this issue for now.