The goal of this project is to create a series of statistical models regarding NBA statistics, with the ultimate goal of implementing a website predicting odds for each major award and all star selections. This includes developing a comprehensive scraper to efficiently pull data from Basketball Reference, respecting the website's Use of Data policies. The scraper is built in Python, and the data analysis is done in R.
This project also aims to eventually combine game statistics with social media data. There is no doubt all star voting and awards selection is heavily influenced by social media trends and media narratives, and I suspect accounting for this would improve the model. The first step for this will be including Twitter data, pending a developer account application.
An article on this project has been published by Towards Data Science. This article can be found here.
At this stage, usage of the project is fairly self explanatory. Simply use scraper.py for the scraping functionality, or loadData.R and MVPModels.R. Prerequisites for each of these, and libraries used are listed below or within the respective files.
- Scraper for MVP data, season player total statistics, season player per game statistics, season player advanced statistics, and team standings
- Models predicting MVP outcome using MVP data and total and advanced player statistics
- Leave-one-out cross validation system to quickly evaluate new models
Scraper
Add scraping for advanced statistics- Add scraping for other awards
MVP Models
Model using per game statistics- Accounting for shortened seasons
- Accounting for reverse recency bias
- Adding feature to analyze player improvement relative to past season
Add advanced stats to models- Account for post All-Star break numbers
Long-term
- Create models for other awards
- Add social media data
- Merge data into single dataset and a build a system to query that dataset
- Deploy models into live-updating website
To use this code, you'll need a Python installation, along with Selenium with the appropriate drivers for the web crawler. You'll also need a R distribution installed -- I recommend RStudio. All of this is free to use, with links provided below.
The goal for this project is to ultimately deploy it live onto a website, but the project is far from that point. Once the project is deployed, this README will be updated to include deployment instructions.
- Atom - Text editor used
- Hydrogen - Interactive coding environment for Atom
- RStudio - IDE for R
- Selenium - Framework for web crawling
Any contributions are welcome, as long as the code is functional and at least some effort is made to document it.
- Caio Brighenti - Developer - CaioBrighenti
This project is licensed under the MIT License - see the LICENSE.md file for details
- Stack Overflow - For obvious reasons
- Harvard Sports Analytics - For inspiration and a baseline comparison