Incomplete / weird game stats have snuck into this package...
Closed this issue · 3 comments
Issues detected with the games listed below. In some cases, the game data needs to be removed entirely from the {wnblr} package. In other cases, it may be data in specific columns that needs to be adjusted or removed but other stats from the same game can be retained as they appear correct.
season | page_id | competing teams | issue | action | status |
---|---|---|---|---|---|
2020 | 1777522 | PER vs. MEL | opposing team total player minutes are not equal, but other stats seem ok | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
2018 | 1087574 | ADL vs. MEL | missing 2nd half of game | remove this game from the package entirely | fixed in scraping, need to migrate into pkg |
2018 | 913531 | BEN vs. TSV | player minutes do not add up to an appropriate total, and opposing team minutes are not equal, but other stats seem ok | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
2017 | 803093 | PER vs. MEL | missing the last ~7 min of the 4th quarter | remove this game from the package entirely | fixed in scraping, need to migrate into pkg |
2015 | 270586 | PER vs. TSV | PER minutes are ~8 min lower than they should be for a 4-quarter game | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
2015 | 137264 | UCC vs. MEL | UCC minutes are ~10 min lower than they should be for a 4-quarter game | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
2014 | 64561 | UCC vs. WCW | UCC minutes are ~20 min lower than they should be for a 4-quarter game | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
2014 | 64586 | MEL vs. TSV | MEL minutes are ~15 min lower than they should be for a 4-quarter game | correct team minutes in box_scores , consider removing player minutes from box_scores_detailed ? |
fixed in scraping, need to migrate into pkg |
there's probably some general tidying up that can be done to the summed team minutes that are stored in box_scores
. currently, summed team minutes are calculated from the total of player minutes within each team. this approach has been helpful for detecting errors, but the data also contains errors in this form that don't make sense with 10-minute quarters.
we can do a quick visual check for unique team minutes values:
box_scores %>%
distinct(minutes, .keep_all = TRUE) %>%
select(season, page_id, team_name, team_name_opp, minutes) %>%
arrange(desc(season), desc(page_id)) %>%
View()
excluding cases listed above that reflect incomplete records from fiba livestats, i think we can safely group and revalue team minutes in box_scores
as follows:
- if the game was played out in 4 quarters of regular time, then
minutes == "200:00"
- if the game went to OT1, then
minutes == "225:00"
- if the game went to OT2, then
minutes == "250:00"
Making good progress - still need to:
- Refactor 2019 scraping scripts to pull page_ids from the appropriate data frame
- Revise 2019 team box scores scraping script with simpler determination of total team minutes
- Rescrape 2019 team box scores data
- Refactor 2020 scraping scripts to pull page_ids from the appropriate data frame
- Revise 2020 team box scores scraping script with simpler determination of total team minutes
- Rescrape 2020 team box scores data
- Revise 2020 detailed team box scores scraping script to omit player minutes from page_id noted above
- Rescrape 2020 detailed team box scores data
That got a little complicated...I think I'm done with rescraping tidied up data in relation to this issue.
In addressing the restructuring of the package (#32), the next release of the wnblr
package will draw on updated data. Keeping this issue open for now since #32 is not done yet, but I don't think there is any further work required specific to this issue for now.