Random Football

Predicting NFL Game Winners using Random Forests

Flatiron mod 5 project by Llew and Jon

Question - Can we predict the winner of NFL games?

In general: not very well. Football games are determined by many random factors with individual plays having significant impact on the results. The results of those individual plays are often determined by slim margins (for example, did the ball cross the line to gain? or did the players shoe-touch the out-of-bounds line?)

We revised our goal to be: can we beat Vegas in picking the winner? Vegas oddsmakers select a favorite team, implied by the point spread. If the point spread for a team is negative, they are the favorite.

Libraries Used

pandas
numpy
patsy
sklearn

Data

NFL Scores and Stadium Data

https://www.kaggle.com/tobycrabtree/nfl-scores-and-betting-data

Game info going back to 1966 including scores, weather, & betting spread and over/under

FiveThirtyEight Elo Data https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv

FiveThirtyEight Elo ratings for each game back to 1920

Features gathered from the kaggle dataset included:

Season (Year)
Week of Season
Elevation
Home Team
Away Team
Weather Conditions

From the FiveThirtyEight Elo Data:

Elo probability of home team win

Additional features were engineered from the Kaggle data. The full dataset was rearranged for each team to calculate the following:

Season Win Percentage prior to the current game
Season Average Points Scored prior to the current game
Distance Travelled since last game
Days Elapsed since last game

Modeling

We selected the 2017 NFL season as our initial testing data, which allowed us to have a sizeable history of training data and also have the ability to move the model forward in time and test on the 2018 NFL season.

After initial attempts to model using logistic regression, support vector machines, gradient boosting, and single decision trees, we found the most successful model was a random forest regressor.

Random forests do not require feature scaling, so our features remained in their original units.

We used a Grid Search to find the best fitting hyperparameters for the random forest. However, we note that the Grid Search produced different results on each run due to the random nature of random forests. We arbitrarily assigned a random state to the Grid Search to maintain a single solution set of parameters.

The key parameters of the random forest regressor include:

‘n_estimators’: 50
- Number of decision trees
'bootstrap': True
- Bootstrap samples are used to build decision trees
'max_depth': 10 * The maximum depth of the tree is 10.
'max_features': 'sqrt' * The number of features to consider when looking for the best split is limited to sqrt(n_features)
'min_samples_leaf': 4 * The minimum number of samples required to be at a leaf node is 4.
'min_samples_split': 2 * The minimum number of samples required to split an internal node is 2.

Our model produced the following features as the most important and the associated Feature Importance Percentage:

Elo Probability Home - 22%
Away Team Average Score - 9%
Home Team Season Win % - 8%
Away Team Season Win % - 7%
Away Team Travel Distance - 7%

The model was run on each data in our dataset beginning with the 2012 NFL Season (which only trained on the 2011 NFL season). We limited the training data to 4 seasons, based on both decreasing accuracy at greater numbers of seasons, and the intuition that NFL teams turn over their players and/or coaching staff in this period of time.

Results

Test Season	Training Season(s)	Baseline All-Home Accuracy	Target Vegas Accuracy	Random Forest Accuracy	Close-call Home Accuracy	Close-call Abstain Accuracy
2012	2011	56.55%	63.67%	62.17%	61.42%	69.43%
2013	2011-2012	59.18%	64.42%	60.30%	58.43%	69.60%
2014	2011-2013	57.30%	68.54%	66.29%	62.55%	70.00%
2015	2011-2014	54.31%	64.04%	63.30%	59.18%	70.27%
2016	2012-2015	57.68%	67.04%	64.04%	58.80%	66.87%
2017	2013-2016	56.93%	60.30%	64.04%	62.17%	68.39%
2018	2014-2017	58.80%	67.42%	62.17%	63.67%	64.33%

Conclusion

Random Forest model accuracy (62%-66%) regularly beats a naive, home-team-wins strategy (54%-59%).
With Elos included and using a strategy of not betting on close-calls, Random Forest Regressor model accuracy (64%-70%) beat out Vegas odds (60%-68%).

ortsed/mod5_project