- Create a model that predicts a pass better than the baseline
- Identify key drivers for an offensive play resulting in a pass
- Incorporate clustering to refine model
- Construct a machine learning regression classification model
- Deliver a report for the Defensive Coordinator with findings
The NFL has been one of the largest organizations to introduce analytics into its multi-faceted operation and has made available very useful information that can help in better understanding the ever-changing game of American football.
This project is a scenario that a Defensive Coordinator has asked for a report on how offensive mindsets have changed over the last 5 seasons and whether or not available data can be used in order to better predict offensive play outcomes.
In order to provide an idea of the impact NFL data has, I will be taking play-by-play data from the past 5 seasons (2017-2021) and using it in order to predict whether an offensive play will be will result in a pass or a run.
- Has passing increased over the years?
- Is there a relationship between what down it is and passing the ball?
- Is there a linear relationship between current quarter and passing the ball?
- Has offensive produced yards increased over time?
Column | Description | Data Type |
---|---|---|
GameDate | Date game was played | object |
Quarter | Quarter the play happened | int64 |
Minute | Minute the play happened | int64 |
Second | Second the play happened | int64 |
OffenseTeam | Team playing offense | object |
DefenseTeam | Team playing defense | object |
Down | What down the play occured on | int64 |
ToGo | Yards until a 1st down | int64 |
YardLine | Yards needed to score a touchdown | int64 |
SeriesFirstDown | Whether play is first of drive | int64 |
SeasonYear | Season that play occured | int64 |
Yards | How many yards gain on play | int64 |
Formation | Formation of offense | object |
PlayType | Type of play performed | object |
IsPass | Was the play a pass | int64 |
IsInterception | Was the play an interception | int64 |
IsFumble | Was the play a fumble | int64 |
YardLineFixed | Yardline the play started on | int64 |
YardLineDirection | Side of field play started | object |
YTG_bins | Yards to go bins used | category |
QuarterSeconds | Maximum seconds up to current quarter | int64 |
ClockSeconds | Total seconds on the game clock | int64 |
SecondsLeft | Total seconds left in the game | int64 |
The following outlines the process taken through the Data Science Pipeline to complete this project.
Plan ➜ Acquire ➜ Prepare ➜ Explore ➜ Model & Evaluate ➜ Deliver
- Define the project goal
- Determine proper format for the audience (Defensive Coordinator)
- Asked questions that would lead to final goal
- Define an MVP
- Create a function to pull appropriate information from the play-by-play csv files for 2017-2021 seasons
- Create and save a wrangle.py file in order to use the function to acquire
- Ensure all data types are usable
- Create a function that will:
- drop unneeded columns
- get rid of kick formation rows
- get rid of other plays other than pass and run
- create a 'SecondsLeft' column
- Add a function that splits the acquired data into Train, Validate, and Test sets
- 20% is originally pulled out in order to test in the end
- From the remaining 80%, 30% is pullout out to validate training
- The remaining data is used as testing data
- In the end, there should be a 56% Train, 24% Validate, and 20% Test split
- Create an exploratory workbook
- Create initial questions to help explore the data further
- Make visualizations to help identify and understand key driver
- Create clusters in order to dive deeper and refine features
- Use stats testing on established hypotheses
- Use clusters to evaluate drivers of offensive pass plays
- Create a baseline
- Make predictions of models and what they say about the data
- Compare all models to evaluate the best for use
- Use the best performing model on the test (unseen data) sample
- Compare the modeled test versus the baseline
- Present a final Jupyter Notebook that utilizes a written report format
- Make modules used and project files available on Github
-
Have your env file with proper credentials saved to the working directory
-
Ensure that a .gitignore is properly made in order to keep privileged information private
-
Clone repo from github to ensure availability of the acquire and prepare imports
-
Ensure pandas, numpy, matplotlib, scipy, sklearn, and seaborn are available
-
Follow steps outline in this README.md to run final_report_pass_run.ipynb
The goals of this project were to identify drivers of pass plays in order to predict whether or not a play would result in a pass or run for NFL offenses. Key drivers found were the following:
- Down
- Yardline
- togo_cluster (Yards to go along with seconds left in the game) Using these drivers to help our model resulted in an increase of 2.74% over the baseline.
An increase of 2.74% isn't impactful enough to push the model forward to help with predictions in future or real time scenarios. The model must be further refined.
With more time, I would like to:
-
Work on more feature engineering and explore more relationships of categories to passing and/or running the ball.
-
Explore other datasets to find and create features that will help refine our current model or result in the creation of a new model.
-
Look into focusing on plays that happened on specific down and by specific teams.