/analytics-handbook

Getting started with soccer analytics

Primary LanguageJupyter NotebookMIT LicenseMIT

Soccer Analytics Handbook

Devin Pleuler — April 2020

This is probably overdue. I believe that people who have managed to wiggle themselves into dream jobs have a responsibility to help others reach there too. This was written during the depths of social isolation imposed by the COVID-19 pandemic. During this period, I've had an atypically large number of students and career changers reach out to me with questions, and a little extra free time, so I'm finally completing my assigned homework.

There are plenty of resources out there that cover some of the more "how do I get a job in sports analytics" career-strategy questions out there, like THIS and THIS from Sam Gregory. This handbook is more geared at some of the technical skills, concepts, and sports analytics history that I think are worth familiarizing yourself with.

In the handbook you can find three primary things:

  1. Resources and suggestions for technical skills worth having for work in soccer analytics (but can probably be extended to other sports)
  2. A series of tutorials delivered in Jupyter notebook format using StatsBomb Open Data, covering various data science techniques common in soccer analytics.
  3. Collected research and articles that I believe are required reading to get up to speed with both the history and state-of-the-art in soccer analytics.

Live long and prosper. 🖖🏻


Where to start?

The most important attributes for contributing to the soccer analytics landscape are a deep knowledge of the game, an ability to communicate clearly and effectively, and a bucket load of skepticism. Unfortunately, getting a job in soccer analytics is largely independent of these attributes and mostly depends on good fortune and timing. It is not a meritocracy, and I hope that changes.

But the most important technical skill once you have landed a job in soccer analytics is experience with scripting languages, preferably Python or R as they're great for data science.

I personally prefer Python, and therefore my recommendations will be geared in that direction. My primary reasons for this suggestion are:

  • Simple syntax makes it great for first-time programmers
  • Excellent documentation and community support
  • Most analytics departments are using it
  • Plays nicely with others
  • It's magic

There are a ton of great resources online for learning Python, so I'm not going to reinvent the wheel here. Here are some that look good:

Note: Starting with Python 3 (I'd suggest version 3.7+) is probably the best route at this point. Python 2 is in the painful process of being put out to pasture.


Important Python Libraries

From a data science perspective, you can do just about anything worthwhile with the SciPy Stack. All of it's libraries are well-supported, and easily google-able if you run into issues. As a beginner, I wouldn't stray too far from these foundational libraries. If you do, you should have a decent reason for it. Some of its important components include:

  • Numpy - A fundamental library for scientific computing in Python. Particularly great for optimized vector and matrix calculations.
  • Pandas - A fast data analysis and manipulation library. It's DataFrame functionality is super useful (and reminiscent of some good bits of R)
  • Matplotlib is the de facto Python plotting library. It's finicky, but powerful. I've learned to love it.

I'd also suggest scikit-learn (a.k.a. sklearn), which I find very user-friendly and is built on top of the libraries mentioned above. In our tutorials, we will predominantly use the sklearn implementations.


Where can I get data?

First, it's worth explaining what varieties of soccer performance data exist in the wild. Typically, and colloquially, there are two types of data: Event Data and Tracking Data.

Event Data is effectively chronological event-by-event tabulation of on-ball actions. It's typically collected from broadcast footage by third-party collectors and sold on the open market to clubs, broadcasters, the gambling industry, and even private individuals. The primary companies competing in this space are Opta (now owned by STATS Perform) and StatsBomb, but there are other competitors.

Tracking Data is an entirely different beast. Player tracking systems record the coordinate position of every player on the field (and usually the ball), many times per second. State-of-the-art systems collect up to 25 samples-per-second. Because these systems are expensive to install and operate, and require in-stadium hardware, this data is mostly available to the clubs themselves, but academics frequently get their hands on this data in a highly anonymized format through tediously painful research agreements. There are various competitors in this space, such as ChyronHego, Second Spectrum, STATS Perform, Metrica, Signality, and others.

The difference in scale between two data types is enormous. A single game of Event Data features around ~2-3 thousand individual events. A single game of Tracking Data represents 2+ million individual measurements.

StatsBomb has provided a large volume of data "freely available for public use" via their Open Data repository on Github in order to better serve the analytics community. We will be using this data in some of the tutorials below.

Since the creation of this document, there has been some exciting developments in the world of publicly available data. Metrica has released two matches of tracking data, which are the first examples of publicly available tracking data to my knowledge. This is a huge contribution to the soccer analytics community, and I plan on contributing some examples of how to best use tracking data.


Jupyter Notebooks!

How could I go so long without mentioning Jupyter? That's because it deserves it's own section. I discovered Jupyter only a year-or-two ago, and I've become a much stronger analyst because of it.

Jupyter Notebooks are easily sharable documents that contain executable Python code alongside human-readable text for annotative purposes. They're perfect for sharing code and demonstrating concepts. We will be using these the deliver the tutorials below.

The notebooks will by hosted on Google Colab, which allows you to write, run, and share Jupyter notebooks within your Google Drive. For Free!

If you're unfamiliar with Google Colab (or Jupyter), check out this introduction video.


Soccer Analytics Tutorials

1. Data Extraction & Transformation

Open In Colab

Parsing raw StatsBomb data and storing it in a Pandas DataFrame

2. Linear Regression

Open In Colab

Examining the relationship between a player's pass volume and completion percentage

3. Logistic Regression

Open In Colab

Predicting the outcome of a shot given its features

4. Clustering

Open In Colab

Identifying different types of passes

5. Database Population & Querying

Open In Colab

Using Pandas & SQLAlchemy to store and retrieve StatsBomb event data

7. Data Visualization

Open In Colab

Create a Passing Network from the 2018 Men's World Cup Final

8. Non-Negative Matrix Factorization

Open In Colab

Using NNMF to uncover spatial components of individual player contribution.

9. Pitch Dominance

Open In Colab

Loading and displaying the Metrica sample tracking data, and building a basic pitch dominance model.

10. XGBoost Classification

Coming soon

11. Convolutional Neural Network

Coming soon


What other skills are worth picking up:

After the programming side, my suggestion is earning some experience with relational databases. In particular, I think MySQL or PostgreSQL are great places to start. Like the rest of my recommendations, they're both open source. I mention that here because you can find a ton of enterprise solutions in this area.

Understanding SQL, which has various dialects (but you really only need to know one to adequately Google the quirks between them), is important for efficiently fetching data before processing it. At some clubs, for a hire that is coming into an already functioning analytics department, this skill is possibly the most important.

I use a lot of sqlalchemy, which has a little bit of a learning curve, but I've found tremendously useful for bridging the gap between Python and SQL. And it's super cross-platform.

Don't forget Excel. It's possibly the most important piece of software ever built. Nobody is too good for Excel.

Having some data-visualization experience in your toolkit is also valuable. After Matplotlib, I would recommend:

  • D3.js is highly recommended for those with even a bit of Javascript and web development experience. The learning curve is totally worth it.
  • Altair for those making the transition over from R and really miss ggplot, one of R most redeeming qualities.
  • Seaborn is a nice visualization library built on top of Matplotlib.
  • Tableau is totally fine. The tradeoff between customizability and ease of use is worth it in plenty of situations. Don't be a hero.
  • Don't forget how powerful conditional formatting is in Excel.

Knowing some basic version control is really important for working effectively on a data science or analytics team. Git (and GitHub) is the easy recommendation here. Also, code testing is a thing, unbeknownst to a majority of my code. I'd suggest using nose.

It's probably worth adding a note about IDE's (integrated development environment) in here for the sake of completeness (i.e. what you write your code in). I've raved about Jupyter notebooks above, but they aren't great for larger software projects.

Personally, I enjoy using Atom (made by GitHub) because I'm apparently a glutton for punishment. A lot of people swear by PyCharm, and others love VS Code. They're all fine. It's also smart to get familiar with vim or emacs, and general bash commands. Survival skills in the command-line environment is important when you start getting into data engineering stuff.

When you eventually reach a place where you might want to put some of your analytics stuff online, but don't want to leave Python, I'd suggest using one of these web frameworks:

  • Flask is an awesome lightweight framework that lets you prototype stuff easily and quickly. Great for building APIs.
  • Django is a fully-featured framework that is a bit harder to use, but does a lot of hard-stuff for you. It's ORM is quite similar to sqlalchemy, which is a plus.

And for the more-experienced analytics enthusiasts, I'd suggest picking up some of these:

There are lots of different ways to install both Python and all these different packages. The easiest way to get up and running on your local machine is probably Anaconda. I also suggest learning how to use pip and virtual environments.


Soccer Analytics Research:

They've created a python library from this research. Find in Resources section below.


Some Favorite Blog Posts

Many of these are borrowed from Sam Gregory's list here. This is far from complete, and will definitely add to this from time to time.


Recommended Watching:

Resources:
Books:
Looking for Ideas?

I maintain a Twitter Thread of potential ideas that I think would be interesting soccer analytics projects.