analytics-handbook: A Jupyter Notebook repository from alucab

Soccer Analytics Handbook

Devin Pleuler — April 2020

This is probably overdue. I believe that people who have managed to wiggle themselves into dream jobs have a responsibility to help others reach there too. This was written during the depths of social isolation imposed by the COVID-19 pandemic. During this period, I've had an atypically large number of students and career changers reach out to me with questions, and a little extra free time, so I'm finally completing my assigned homework.

There are plenty of resources out there that cover some of the more "how do I get a job in sports analytics" career-strategy questions out there, like THIS and THIS from Sam Gregory. This handbook is more geared at some of the technical skills, concepts, and sports analytics history that I think are worth familiarizing yourself with.

In the handbook you can find three primary things:

Resources and suggestions for technical skills worth having for work in soccer analytics (but can probably be extended to other sports)
A series of tutorials delivered in Jupyter notebook format using StatsBomb Open Data, covering various data science techniques common in soccer analytics.
Collected research and articles that I believe are required reading to get up to speed with both the history and state-of-the-art in soccer analytics.

Live long and prosper. 🖖🏻

Where to start?

The most important attributes for contributing to the soccer analytics landscape are a deep knowledge of the game, an ability to communicate clearly and effectively, and a bucket load of skepticism. Unfortunately, getting a job in soccer analytics is largely independent of these attributes and mostly depends on good fortune and timing. It is not a meritocracy, and I hope that changes.

But the most important technical skill once you have landed a job in soccer analytics is experience with scripting languages, preferably Python or R as they're great for data science.

I personally prefer Python, and therefore my recommendations will be geared in that direction. My primary reasons for this suggestion are:

Simple syntax makes it great for first-time programmers
Excellent documentation and community support
Most analytics departments are using it
Plays nicely with others
It's magic

There are a ton of great resources online for learning Python, so I'm not going to reinvent the wheel here. Here are some that look good:

The Hitchhiker's Guide to Python
The Python Tutorial direct from the official source: docs.python.org.
Plenty others

Note: Starting with Python 3 (I'd suggest version 3.7+) is probably the best route at this point. Python 2 is in the painful process of being put out to pasture.

Important Python Libraries

From a data science perspective, you can do just about anything worthwhile with the SciPy Stack. All of it's libraries are well-supported, and easily google-able if you run into issues. As a beginner, I wouldn't stray too far from these foundational libraries. If you do, you should have a decent reason for it. Some of its important components include:

Numpy - A fundamental library for scientific computing in Python. Particularly great for optimized vector and matrix calculations.
Pandas - A fast data analysis and manipulation library. It's DataFrame functionality is super useful (and reminiscent of some good bits of R)
Matplotlib is the de facto Python plotting library. It's finicky, but powerful. I've learned to love it.

I'd also suggest scikit-learn (a.k.a. sklearn), which I find very user-friendly and is built on top of the libraries mentioned above. In our tutorials, we will predominantly use the sklearn implementations.

Where can I get data?

First, it's worth explaining what varieties of soccer performance data exist in the wild. Typically, and colloquially, there are two types of data: Event Data and Tracking Data.

November 2020 Addition: It's probably now worth including Broadcast Tracking as a standalone category.

Event Data is effectively chronological event-by-event tabulation of on-ball actions. It's typically collected from broadcast footage by third-party collectors and sold on the open market to clubs, broadcasters, the gambling industry, and even private individuals. The primary companies competing in this space are Opta (now owned by STATS Perform) and StatsBomb, but there are other competitors.

Tracking Data is an entirely different beast. Player tracking systems record the coordinate position of every player on the field (and usually the ball), many times per second. State-of-the-art systems collect up to 25 samples-per-second. Because these systems are expensive to install and operate, and require in-stadium hardware, this data is mostly available to the clubs themselves, but academics frequently get their hands on this data in a highly anonymized format through tediously painful research agreements. There are various competitors in this space, such as ChyronHego, Second Spectrum, STATS Perform, Metrica, Signality, and others.

The difference in scale between two data types is enormous. A single game of Event Data features around ~2-3 thousand individual events. A single game of Tracking Data represents 2+ million individual measurements.

Broadcast Tracking is a new variety of data that has rapidly grown in popularity over the last couple years. As the state-of-the-art in computer vision has progressed rapidly, the problem of collecting high-resolution tracking data from broadcast video has become a tractable problem. Obviously what is being collected is not a complete data set, but obviously the most important and relevant areas are captured. The leaders in this space appear to be SkillCorner and Sportlogiq.

The introduction of Broadcast Tracking is particularly interesting for the player recruitment theater. Since access to full-tracking data is typically limited to teams in a single league, it provides scouting departments a more complete picture of players in leagues that their team does not belong to.

StatsBomb has provided a large volume of data "freely available for public use" via their Open Data repository on Github in order to better serve the analytics community. We will be using this data in some of the tutorials below.

Metrica has released two matches of tracking data, which are the first examples of publicly available tracking data to my knowledge. This is a huge contribution to the soccer analytics community, and I plan on contributing some examples of how to best use tracking data.

SkillCorner has provided 9 matches of broadcast tracking data into open source.

Last Row (Ricardo Tavares) has provided some tracking-like data for educational purposes on the Friends of Tracking github.

Jupyter Notebooks!

How could I go so long without mentioning Jupyter? That's because it deserves it's own section. I discovered Jupyter only a year-or-two ago, and I've become a much stronger analyst because of it.

Jupyter Notebooks are easily sharable documents that contain executable Python code alongside human-readable text for annotative purposes. They're perfect for sharing code and demonstrating concepts. We will be using these the deliver the tutorials below.

The notebooks will by hosted on Google Colab, which allows you to write, run, and share Jupyter notebooks within your Google Drive. For Free!

If you're unfamiliar with Google Colab (or Jupyter), check out this introduction video.

Soccer Analytics Tutorials

1. Data Extraction & Transformation

Parsing raw StatsBomb data and storing it in a Pandas DataFrame

2. Linear Regression

Examining the relationship between a player's pass volume and completion percentage

3. Logistic Regression

Predicting the outcome of a shot given its features

4. Clustering

Identifying different types of passes

5. Database Population & Querying

Using Pandas & SQLAlchemy to store and retrieve StatsBomb event data

7. Data Visualization

Create a Passing Network from the 2018 Men's World Cup Final

8. Non-Negative Matrix Factorization

Using NNMF to uncover spatial components of individual player contribution.

9. Pitch Dominance

Loading and displaying the Metrica sample tracking data, and building a basic pitch dominance model.

10. Convolutional Neural Networks

Building pass difficulty surfaces using a convolutional neural network.

What other skills are worth picking up:

After the programming side, my suggestion is earning some experience with relational databases. In particular, I think MySQL or PostgreSQL are great places to start. Like the rest of my recommendations, they're both open source. I mention that here because you can find a ton of enterprise solutions in this area.

Understanding SQL, which has various dialects (but you really only need to know one to adequately Google the quirks between them), is important for efficiently fetching data before processing it. At some clubs, for a hire that is coming into an already functioning analytics department, this skill is possibly the most important.

I use a lot of sqlalchemy, which has a little bit of a learning curve, but I've found tremendously useful for bridging the gap between Python and SQL. And it's super cross-platform.

Don't forget Excel. It's possibly the most important piece of software ever built. Nobody is too good for Excel.

Having some data-visualization experience in your toolkit is also valuable. After Matplotlib, I would recommend:

D3.js is highly recommended for those with even a bit of Javascript and web development experience. The learning curve is totally worth it.
Altair for those making the transition over from R and really miss ggplot, one of R most redeeming qualities.
Seaborn is a nice visualization library built on top of Matplotlib.
Tableau is totally fine. The tradeoff between customizability and ease of use is worth it in plenty of situations. Don't be a hero.
Don't forget how powerful conditional formatting is in Excel.

Knowing some basic version control is really important for working effectively on a data science or analytics team. Git (and GitHub) is the easy recommendation here. Also, code testing is a thing, unbeknownst to a majority of my code. I'd suggest using nose.

It's probably worth adding a note about IDE's (integrated development environment) in here for the sake of completeness (i.e. what you write your code in). I've raved about Jupyter notebooks above, but they aren't great for larger software projects.

Personally, I enjoy using Atom (made by GitHub) because I'm apparently a glutton for punishment. A lot of people swear by PyCharm, and others love VS Code. They're all fine. It's also smart to get familiar with vim or emacs, and general bash commands. Survival skills in the command-line environment is important when you start getting into data engineering stuff.

When you eventually reach a place where you might want to put some of your analytics stuff online, but don't want to leave Python, I'd suggest using one of these web frameworks:

Flask is an awesome lightweight framework that lets you prototype stuff easily and quickly. Great for building APIs.
Django is a fully-featured framework that is a bit harder to use, but does a lot of hard-stuff for you. It's ORM is quite similar to sqlalchemy, which is a plus.

And for the more-experienced analytics enthusiasts, I'd suggest picking up some of these:

Apache Spark (and Databricks) for massive code parallelization across clusters.
Numba for high performance Python.
Tensorflow and/or Keras for deep learning (also PyTorch).

There are lots of different ways to install both Python and all these different packages. The easiest way to get up and running on your local machine is probably Anaconda. I also suggest learning how to use pip and virtual environments.

Soccer Analytics Research:

A Framework for Tactical Analysis and ... by Sarah Rudd
An Extension of the Pythagorean Expectation ... by Howard Hamilton
Large-Scale Analysis of Soccer Matches ... by Alina Bialkowski et. al
Spatio-Temporal Analysis of Team Sports – A Survey by Joachim Gudmundsson and Michael Horton
Physics-Based Modeling of Pass Probabilities in Soccer by Will Spearman et. al.
Data-Driven Ghosting using Deep Imitation Learning by Hoang M. Le, Peter Carr, Yisong Yue, and Patrick Lucey
Beyond Expected Goals by Spearman
Not All Passes Are Created Equal: ... by Paul Power et. all
Wide Open Spaces: ... by Javier Fernandez and Luke Bornn
Decomposing the Immeasurable Sport: ... by Fernandez, Bornn, and Dan Cervone
Modelling the Collective Movement of Football Players by Francisco José Peralta Alguacil
Player Vectors: Characterizing Soccer Players’ Playing Style ... by Tom Decroos and Jesse Davis
Actions Speak Louder than Goals: ... by Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis

They've created a python library from this research. Find in Resources section below.

Dynamic Analysis of Team Strategy in Professional Footbal by Laurie Shaw and Mark Glickman
Ready Player Run: Off-ball run identification and classification by Sam Gregory
SoccerMap: A Deep Learning Architecture for ... by Javier Fernandez and Luke Bornn
A new look into Off-ball Scoring Opportunity: ... by Hugo M. R. Rios-Neto, Wagner Meira Jr., Pedro O. S. Vaz-de-Melo

Some Favorite Blog Posts

Assessing The Performance of Premier League Goalscorers by Sam Green
Counting Across Borders by Ben Torvaney
Defending Your Patch by Thom Lawrence
Pass Footedness in the Premier League by James Yorke
Messi Walks Better Than Most Players Run by Bobby Gardiner
Game of Throw-Ins by Eliot McKinley
Expected Threat by Karun Singh
Passing Out at the Back by Will Gürpinar-Morgan
The 10 Commandments of Football Analytics by Tom Worville
Breaking Down Set Pieces ... by Euan Dewar
Data Based Coaching: ... by Kieran Doyle
Coaches Reward Goalscorers ... by McKinley and John Muller

Many of these are borrowed from Sam Gregory's list here. This is far from complete, and will definitely add to this from time to time.

Recommended Watching:

Self-Supervised Representations for Tracking Data

This 2020 OptaPro Forum talk from Karun Singh represents some state-of-the-art research around autoencoders and feature extraction from tactical context.
An American Analyst in London

Fun conversation at SSAC 2019 between StatsBomb CEO Ted Knutson, Houston Rockets GM Daryl Morey, and some other guy.
Beyond the Baseline: ...

This classic 2018 OptaPro Forum talk from the effervescent Marek Kwiatkowski is one of my favorites. Suggests a mixed model approach for personalizing certain soccer metrics.
Some Things Aren't Shots

Great talk from Thom Lawrence at the 2019 StatsBomb Innovation Conference covering approaches to Expected Possession value.
Beyond Save Percentage

Probably the smartest stuff I've seen on evaluation of goalkeeper performance, presented by Derrick Yam.
Statistics for Hackers

This PyCon 2016 talk from Jake VanderPlas is a great crash course in doing statistics with for loops. It really provides a great perspective for those of us without an extensive background in hard statistics. Great speaker, too.
Friends of Tracking

This whole series, produced by a handful of soccer analytics experts including David Sumpter, is not-to-miss. It probably the most comprehensive resource out there for getting started in soccer analytics. And it uses python!

Resources:

socceraction

A python library for valuing the individual actions performed by soccer players. Includes an Expected Threat (xT) implementation. From Tom Decroos et. al.
statsbombpy

A python library written by Francisco Goitia to access StatsBomb data.
matplotsoccer

A python library for visualising soccer event data. Also by Tom Decroos.
ggsoccer

Not Python, but this soccer visualization library from Ben Torvaney is great.
statsbomb-data-parser

A python library to convert StatsBomb's JSON data into CSV format.
Python Data Science Handbook

Jake VanderPlas made his entire Python Data Science Handbook and accompanying Jupyter notebooks available online. It's a tremendous resource.

Books:

The Numbers Game by Chris Anderson and David Sally
Football Hackers by Christoph Biermann
Soccermatics by David Sumpter

Looking for Ideas?

I maintain a Twitter Thread of potential ideas that I think would be interesting soccer analytics projects.

alucab/analytics-handbook