/CampusPigskinRank

A machine learning approach for anointing a proper college football national championship

CampusPigskinRank

A machine learning approach for anointing college football's national championship

Motivation

Most people tend to believe that we'll be able to use computerized and human logic to crack the S&P 500 before we consistently slay the dragon that is calculating college football's most worthy matchup for the national championship, primarily in the Bowl Championship Series era. Perhaps to solve this, the current College Football Playoff format uses a consensus of a human panel, without computer rankings. Much to the contrary of most fans, media members and followers of the sport, I actually favored the now-defunct BCS, the formula that was created in an attempt to declare a national champion for the Football Bowl Subdivision (nee, "NCAA Division I").

I enjoyed the fact that the index used to setup a game between the true #1 and #2 teams in the nation was based on a three-pronged approach: an equal-parts combination of two polls generated by human assessors, with a computerized poll.

The latter poll was itself an index of six computer-generated rankings. I tend to lean towards and be inspired by Jeff Sagarin's formula and John Hollinger's power rankings more than the others...and to a slightly lesser degree, the RPI. To date, the most progress we've made on the front of beating the BCS has been the excellent paper produced by Microsoft Research that leverages uncertainty and pairwise comparisons of teams and conferences as contributing factors.

The intent of this repo is to generate a ranked index of re-computed season data generated by the BCS for all 15 seasons of its existence, based on a purely objective perspective, and compare the re-ranked teams with how each season actually played out.

(An all-too brief) Revisionist history

Although the BCS's overall formula was rewritten several times in its 15-year existence to reflect a more accurate arrangement of the top two teams, and ultimately replaced with the four-team College Football Playoff, there was a still a lot of room for improvement. Accounting for strength of schedule proved to be rather problematic. Books like Polls, Bowls and Tattered Souls and Death to the BCS - both of which are excellent reads - discussed the challenges of composing such an index based on the system that was developed.

A machine learning approach

My hypothesis for applying machine learning is that by definition, this would be a ranking problem. The model will be built as a recommender engine using training data from the BCS seasons. The original PageRank, the algorithm upon which Google Search is based, has been forked to assign value to each entity (the team) and rank them. Other have attempted to apply Bayesian probability and statistical techniques to estimate each team's rank. A reliable prediction system can produce results about who the teams most worthy to compete for the national champion can be. The key is in the features - the ambiguity and major components that determine the fitness of each team and conference relative to the others in the FBS.

Dataset

This project uses the public dataset compiled by TheNationalChampionshipIssue. Based on the above assumption about data features, I'd like to evaluate if techniques like latent Dirichlet allocation and latent semantic indexing might better mathematically identify features in the game data that weigh with more importance than what human input might miss - not just that a team won but how they won.

However, I'm prematurely skeptical about the ability to generate such features, seeing as how datasets from other more static domains like the Netflix dataset or the MovieLens dataset for films have more reliable attributes than just a team winning.

Framework

The code is built on top of Google's TensorFlow machine learning framework, and written in Python.


I'm a product manager, author and sportscaster in Guam. I know software, sports, movies, music, marketing, the 80s...and not much else. Find me on Twitter, Facebook and LinkedIn.