bigballR
is an R package for working with NCAA Basketball data. This
package primarily revolves around the use of schedule, roster, and play-by-play data via stats.ncaa.com, and
additionally has features to calculate lineups, on/off results, and player game and multi-game statistics.
First install the package devtools
if you haven't already
#install.packages("devtools")
devtools::install_github("jflancer/bigballR")
Retrieving Game IDs and Other Information
Manually, game ids can be found in the url when browsing games, for example:
4674164
is the game id for https://stats.ncaa.org/game/play_by_play/4674164
get_date_games
data frame with game info and game ids for a given date.get_team_schedule
data frame with game info and ids for a given team.get_team_roster
data frame with player roster information for a given team.
Game Scraping Functions
scrape_game
the base function, given a game id will return a cleaned play by play file.get_play_by_play
builds off of scrape_game, allows for compiling from multiple game ids formatted in a vector.
Data Manipulation Functions
get_lineups
given play-by-play data retrieved from game scraping functions, will calculate a variety of stats for all lineups.on_off_generator
calculate on/off statistics for specified players given lineup data from get_lineups().get_player_stats
calculate a variety of player stats from play-by-play data at either a game or multi-game level.get_player_lineups
acts as an easy way to filter lineup data, specifying players to exclude/include from the lineups.
Datasets
teamids
dataset includes stats.ncaa team name, team conference, season, and team id for 17-18 and 18-19 seasons
There are many different progressions and ways to use this package. As an example, here are some natural steps you could take.
# Get team schedule
# Note: if you don't know the proper team.name (case sensitive), you can look it up in data("teamids")
schedule <- get_team_schedule(season = "2018-19", team.name = "Duke")
# Get play by play for all games played so far in season
play_by_play <- get_play_by_play(schedule$Game_ID)
# Generate all lineups and stats from the play by play
lineups <- get_lineups(play_by_play_data = play_by_play, keep.dirty = T, garbage.filter = F)
# Look at Zion Williamson's on/off statistics with lineups that include Reddish and Barrett
zion_comparison <- on_off_generator("ZION.WILLIAMSON", lineups, Included = c("CAM.REDDISH","RJ.BARRETT"))
Functions to retrieve play by play data. scrape_game() works for individual games while get_play_by_play can handle a vector of gameids and will aggregate into a single dataframe. Warns users of potential errors and mistakes made by the game trackers. The number of player discrepancies warning counts displays the number of events players committed when it is found they were not on the court at the time of the event. The substitution mistake warning indicates an unclean substitution was entered. (ex. 2 players enter and 1 leaves)
- game_id | string made up of digits given to each unique game. This can be found in the play-by-play url for each game.
- game_ids | same function as above, but can handle character vectors of any length.
This function returns a schedule for the given date and specified conference. Results are included if applicable, as well as the play-by-play game id
- date | a character object containing a date in the format mm/dd/yyyy. Defaults is previous day (yesterday)
- conference | the common name used for a conference, not sensitive to case, spacing, punctuation, etc.
- conference.ID | alternatively, if the conference ID is known it replace the conference name variable.
This function returns a data frame of the schedule for the specified team. This will include game ids used for play-by-play scraping if the game has ended, along with the team scores and attendance. Note: currently, the season/team.name parameters can only be used for the 2016-17, 2017-18, 2018-19 seasons.
- team.id | The unique id given to each college/team for each season. This can be found in the url of the team page.
- season | Season following format yyy1-y2, ex "2018=19"
- team.name | Alternative to using the id, you can use a team name. This follows the format found in data(ids).
This function returns a data frame of the roster for the specified team. This will include player names and positions as well as jersey number, height and school year. Note: currently, the season/team.name parameters can only be used for the 2016-17, 2017-18, 2018-19 seasons.
- This takes the same parameters and conditions as get_team_schedule() above
This function takes in a play-by-play dataframe, and generates all possible lineups for both teams. It then calculates a variety of statistics/metrics at a lineup level.
- play_by_play_data | play-by-play data from the functions scrape_game() or get_play_by_play()
- keep.dirty | logical to specify whether or not to filter out potentially inaccurrate data. When FALSE, will remove all rows from games where the number of discrepencies is above the desired count.
- garbage.filter | logical variable to specify whether or not to filter out garbage time entries.
- error.thresh | lets user set their preferred discrepancy threshold with the keep.dirty variable. This means when less than the threshold occurs in a game, it will be considered clean. As defined in scrape_game(), a discrepancy occurs when a player registers an event when they are not found to be on the court.
This function passes in lineup data and calculates the on/off lineup statistics for all lineup combinations of players specified. This allows users to view on/off statistics for individual players, as well as combinations of multiple players. Users can also specify if they'd like specific players to be included or excluded from all lineups in use
- Players | players desired to be compared with on/off
- Lineup_Data | lineups collected from the get_lineups() function
- Included These Players will be on the court for every lineup considered.
- Excluded These players will be off the court for every lineup considered.
This function finds all lineups from a given lineup data source that include/exclude certain players. It acts as a quick way to filter lineups for players.
- Takes same parameters as on_off_generator() with exception to the Players parameter
This function calculates many player stats for either individual games or aggregate to get multi-game stats.
- Takes all parameters described in get_lineups() function
- multi.games | When false stats will be calculated on a game level. When true all games will be aggregated.