The results from this code are not very good. Treat it as an interesting experiment, not as a legitimate source of fantasy rankings. "Data Science" might be BS, but I wouldn't even call this that. Let's call it...data pseudoscience.
This code uses machine learning techniques to attempt to predict fantasy football performance of players based on their performance in past years. Scoring is customizable by league settings, so that predicted points and rankings are tailored to a league's particular desires.
All data was gathered from pro-football-reference.com CSV dumps. This data has the great advantage of being easily available and CSV-formatted, but is not great. In particular, it's missing important stat categories like fumbles and fumble recoveries (and less significantly, 2-point conversions). It's also useless for leagues with point bonuses for big games or big plays (eg, bonuses at 100/200/etc. yards, or 40+ yard plays), because the stats are aggregated over an entire season.
Parsing routines for this data are implemented in parsing.py.
Merely getting data out of the files is pretty easy; with the exception of
headers that recur and are split over two lines (which requires the minimally
stateful parser in parser._parse_file
), and some extra sigils around players
with interesting properties, the data are easy to get. There are instances of
missing data, represented as empty strings.
The interesting work in the parser is that required to construct a unified data set from multiple years' worth of files. In particular, the CSV dumps do NOT contain a unique ID for each distinct player. To do meaningful learning, we must be able to track which player is which from year to year, even in the face of team changes, new players coming in the with same name, and in principle name changes as well. (Thankfully, the data set I used did not make me unify Chad Johnson and Chad Ochocinco.)
This work is all done in parser._assign_ids
. The ID assignment goes through
the parsed data in chronological order. If a player with a particular name has
never been seen before, it assumes this is a new unique player, and assigns a
new ID to this player (tagging it with position, team, and last year seen
playing). Things get more interesting once we start seeing the same names show
up.
To explain the logic, let's say we see a player named N, playing position P, on team T, in year Y. If we saw a player N playing P on T in year Y-1, we assume this is the same player. This is the common case of a player staying on one team from year to year.
Occasionally, you'll see two players both named N, but playing for different teams in the same year Y (for example, consider the Adrian Petersons of the Vikings and of the Bears, or the Steve Smiths of the Giants and the Panthers). These will be considered separate players.
If there was exactly one player sharing N and P in a previous year, but on a different team the previous year, then that player was probably traded, so we assign him the ID from that previous player record. Once in a blue moon, you'll even see a player who keeps the same team, but appears to change position (in the database, this happens to Steve Slaton changing from RB in 2008 to WR in 2009); this will also be marked as the same player.
Those two clauses have a nasty interaction, which is that you may have two players with the same name who "switch" both team and position: for example, consider Alex Smith (TE, TB) and Alex Smith (QB, SF). These cases are specially marked to split the two players.
Finally, because the ID assignment assigns IDs sequentially as it goes through
the parsed file, rather than simultaneously, there are some trades which may not
be possible to disambiguate except manually. The one case in which this appeared
was for Zach Miller (TE), traded from OAK to SEA between 2010 and 2011. There
was also a Zach Miller (TE) playing for JAC in 2010. If the entire file were
parsed, it would be seen that the latter Miller stayed at Jacksonville, so the
likelier parse would be that the Oakland Miller moved to Seattle. In principle,
though, nothing rules out a JAC->SEA and OAK->JAC trade sequence, so this case
is explicitly coded (via a SPECIAL_CASE_TRADES
dict in constants.py).
Code in prediction.py and evaluation.py.
Each player is described by a single feature vector (but notice that this later gets rotated and split in the learning). All features are numeric, and there are two classes of features:
-
"Fixed" features only get one copy in the feature vector, containing the value corresponding to the most recent year of data. These are currently age (in years) and position (four dimensions corresponding to one-hot encoding of QB, WR, RB, or TE).
-
"Tracked" features recur once per year that we have data (eg, if data corresponding to five years is passed into the file, then there will be five copies of each tracked feature, corresponding to each year loaded. For players that don't show up in every year (particularly younger players or players who retire), there will be missing data, represented as absence from the dictionary.
Both types of features are computed by arbitrary Python functions of the data parsed out of the data file; the tracked features used consist of the stats used to compute fantasy scores, as well as the fantasy score itself.
Assume that we're
in 2013, and so have data files for 2012 and 2011. Each tracked feature is
replicated twice, with a tag year_delta
corresponding to how many years back
the data is from the present. For example, "Yards passing, delta=1" would be
passing yards in 2012. This means that a player who last played in 2011
would only have entries in "Yards passing, delta=2"; a 2012 rookie would only
have delta=1, and a veteran might have both.
prediction.featurize_player
takes a stats dictionary for a player (as built
by the parser) and builds a feature dictionary for that player.
The objective of the model is to predict the number of fantasy points scored by each player this year, by leveraging all past data we have on that player. To train and validate the model, we would like to predict the number of fantasy points scored in previous years. However, it is not sensible to use, say, 2012 data to try to predict 2011.
prediction.split_player
takes a feature dictionary for a player and emits
a new feature dictionary for each year that that player played. To understand
its logic, it's easiest to consider an example. Let's say we have a 30-year-old
QB who played in 2011 and 2010, and we have data for 2012, 2011, and 2010.
Let's only track age, position, and passing yards (PassYd for short). This
player might have the following features coming from the initial featurization
pass:
- Age: 30
- isQB: 1; all other positions 0
- PassYd, 1: does not exist (did not play in 2012)
- PassYd, 2: 1200 [2011 yards]
- PassYd: 3: 1500 [2011 yards]
The splitting code copies all fixed features to each new row (with a special-case correction for age). A new row would then be emitted for each year:
- Age 30; QB; (PassYd, 0): 1200; (PassYd, 1): 1500, (PassYd 2-3): missing
- Age 29; QB; (PassYd, 0): 1500; (PassYd, 1-3): missing
This data can then be used to build a uniform predictor that jointly trains over all years: we can just build a model to predict on the target property at year delta 0. The given example is obviously trivial, since only one instance actually has much data beyond the objective (PassYd, 0), but the real system would shift and replicate all the tracked stats.
Once the splitting is complete, the split feature dictionaries are easily turned
into a feature matrix in prediction.construct_feature_matrix
. Columns are
created for the union of all features present in the given instances; missing
entries are encoded as NaN to be resolved later in the pipeline.
Given the matrix form of features, learning is a straightforward regression
problem. The learning pipeline, implemented separately in both
prediction.cross_validate
and prediction.predict_current_year
(DRY
violation, but I was running up against the draft deadline!), first fills in
missing features by mean-value imputation. It then optionally does zero-mean,
unit-variance standardization. This step is necessary for some learning
algorithms (e.g., support vector regression), but complicates interpretation of
the output data unless you undo the scaling. It turns out SVR doesn't actually
work that well, it's annoying to back out the shift and scale for fantasy
points, and the other algorithms mostly don't care, so the main code
does not use standardization.
prediction.cross_validate
uses k-fold cross validation to test the performance
of the model(s). The input data is split into k-folds; on each fold we fit the
imputer, standardizer, and model only on the training data and apply them to
both training and test data. The regressed scores for each fold are accumulated
until at the end of the cross-validation, we have test predictions for every
player. The unified lists of true and predicted scores are then split into lists
by player position by prediction.position_ranking_lists
.
As a first objective, we're more interested in the relative positioning of players than the absolute points difference. A learning-to-rank method might be better at this problem, but this is an easy substitute and eventually seeing predicted fantasy points is useful as well.
I evaluate the quality of the model by computing Kendall's tau score between
the true and predicted scores for each position in each year (eg, 2008 QB true
and predicted ordering). Kendall's tau coefficient
is a nonparametric statistical test for the dependence of two different
variables. If the rankings produced by the two agree exactly, tau=1; if they
disagree perfectly (one is the reverse of the other), tau=-1; if they are
independent, tau=0. Thus, we prefer models with cross-validated tau near 1.
If the tau is near 0, then the model is learning nothing.
prediction.compute_taus
evaluates the tau scores given the ranking lists for
each year. (NB: it's not actually each year -- it's the last year a given player
played. So, for example, players who played in 2012 may appear in the same bin
as retired players from earlier. This should be OK, since we'll have the true
fantasy points for the same set of players.)
Since there are a lot of players who are completely irrelevant for fantasy
purposes, only the ranking on the high-rank players is of interest. (We
would like the model to distinguish Drew Brees and Mark Sanchez, but we really
don't care about 4th-string quarterbacks.) Thus, tau is actually computed on
a subset of the data set: only the top N (configurable in constants.TOP_N
)
players in each year at each position. We thus report two
parameters for each year/position/model triple: the fraction of players in
the true top N who ranked in the predicted top N, and the tau coefficient
for this intersection list.
main.main
is a driver script that loads the data, featurizes it, displays
cross-validation results from a number of models, and then dumps predicted
scores and rankings for the current year. I tried three major classes of
regression algorithms for this problem:
- Generalized linear models
- Run-of-the-mill linear regression
- L2-regularized "ridge" regression
- Ensemble methods
- Random forest regression
- Extremely randomized trees regression
- Gradient boosted regression trees
- AdaBoost.R2 regression
- Support vector regression
- C-regularized, RBF kernel
- nu-regularized, RBF kernel
There was no particular basis for selecting these models. I don't particularly believe that a pure linear model would work, but it's the easiest possible model to try. Ensemble methods are not particularly interpretable, but do have good practical and theoretical performance. SVMs are theoretically beautiful and a decent shot at a good learning model. Mostly I was guided by what was conveniently available in off-the-shelf format from scikit-learn! I made no effort at hyperparameter optimization. While this is known to have an effect on some models (eg, SVMs), the performance of all methods was sufficiently similar that I didn't expect any magical gains from tuning.
Randomization in cross-validation and ensemble methods means that particular results vary from run to run, but the overall sketch of performance remains fairly similar. Let's take a look at Kendall's tau and the fraction of the top 50 found in the top 50 predicted results for each position, only in the most recent year for each player (ie, the entries with the largest amount of data). Note that the cross-validation fold splits should be the same between models, and I used 10-fold CV.
Algorithm | QB | RB | TE | WR |
---|---|---|---|---|
Linear regression | 0.505, 76% | 0.379, 54% | 0.301, 70% | 0.273, 68% |
Ridge regression | 0.549, 78% | 0.360, 56% | 0.308, 70% | 0.291, 68% |
Random forests | 0.533, 72% | 0.191, 60% | 0.287, 70% | 0.144, 60% |
Extremely randomized trees | 0.544, 74% | 0.266, 58% | 0.152, 68% | 0.166, 62% |
AdaBoost.R2 | 0.511, 74% | 0.455, 62% | 0.285, 56% | 0.255, 56% |
GBRT | 0.531, 76% | 0.360, 56% | 0.374, 74% | 0.217, 62% |
C-SVR | 0.319, 68% | 0.222, 56% | 0.251, 72% | 0.148, 66% |
nu-SVR | 0.301, 68% | 0.276, 58% | 0.244, 72% | 0.163, 66% |
There are a handful of takeaways from this table:
- SVMs perform noticeably worse than any other method. This might be down to the lack of hyperparameter tuning (eg, C/nu fitting).
- Regularization on linear regression doesn't make a huge difference. Ridge (L2 regularized) linear regression performs about the same as the unregularized standard algorithm. The feature vectors here are not very high-dimensional, though there is some linear dependence (eg, fantasy points are a linear combination of other terms).
- Boosting methods performed noticeably better than the random-forest methods. AdaBoost.R2 and GBRTs both had far better performance ranking RBs and WRs (the positions with more active players than others) than did random forests or extremely randomized trees.
The main takeaway, though, is that despite small differences in performance among methods, nothing actually works very well. The best tau scores we see are on the order of 0.5, for QBs down to 0.3 for WRs. While we can sort-of predict who the top 50 players at each position will be (getting maybe 35 out of 50 right), the rankings within those 50 are only a little better than random. Crucially, they're probably not much better than just looking at ESPN or Yahoo rankings (or just looking at some names, if you pay attention to football at all).
This was a fun exercise, but not a terribly useful project in terms of actual performance. It's an interesting demonstration of applying non-time-series methods to predict time-series data for a practical problem.
An interesting enhancement to the model would be to consider team membership as well. A very simple version would be to just add team as a feature; this just considers local "dynasties" where some teams are better than others, period. A more refined version would consider who else a player is playing with. For example, if a quarterback's receivers all get traded away, then it is reasonable to guess that his passing performance will drop off. However, this requires consideration of cross-player considerations, and is not easily integrated into this model (running it as a second stage of classification might work). Integrating injury data and quality of a team's defense may also be interesting sources of information.