/expected-goals-thesis

A repository for analysis on Expected Goals using StatsBomb and Wyscout data

Primary LanguageJupyter NotebookMIT LicenseMIT

  • expected-goals-thesis

A repository for analysis on Expected Goals using StatsBomb and Wyscout data.

StatsBomb data

This repository assumes that the StatsBomb open-data has already been cloned to a local directory.

To run the notebooks

All of the notebooks can be run from an Anaconda environment.

To install the environment yourself use the Anaconda Prompt. Run the following command from the expected-goals directory:

conda env create -f environment.yml

Activate the new environment from the prompt:

conda activate expected-goals

Load Jupyter Notebook from the prompt:

jupyter notebook

1. Data Loading

The data are from StatsBomb open-data and Wyscout.

To run the analysis you need to first create the datasets. By running the notebooks in the notebooks/create-data directory (in numerical order):

  • 00_statsbomb_data_to_parquet.ipynb: creates the StatsBomb data in the data/statsbomb folder
  • 01_wyscout_data_to_parquet.ipynb: creates the Wyscout data in the data/wyscout folder
  • 02_remove_overlap_wyscout_statsbomb.ipynb: removes 100 overlapping games from the Wyscout data
  • 03_wyscout_shot_dataset.ipynb: creates a Wyscout shot dataset: data/wyscout/shots.parquet
  • 04_statsbomb_freeze_frame_features.ipynb: creates some features from the StatsBomb freeze-frame data: data/statsbomb/freeze_features.parquet
  • 05_statsbomb_shot_dataset.ipynb: creates a StatsBomb shot dataset: data/statsbomb/shots.parquet
  • 06_combine_shots_dataset.ipynb: creates an overall shot dataset: data/shots.parquet
  • 07_add_synthetic_shots_remove_outliers.ipynb: removes 227 outliers from the data/shots.parquet generates 1000 fake shots (data/fake_shots.parquet)

2. Modelling

  • 00-explore-data-quality-overlap.ipynb: explores the 100 overlapping Wyscout/StatsBomb games and their data quality
  • 01-expected-goals-model.ipynb: builds two expected goals models: logistic regression and light gradient boosting machines
  • 02-expected-goals-calculate-xg-and-shap.ipynb: calculates xG and shapely values (contributions of the features to the probability of a goal)
  • 03-visualize-models.ipynb: visualize the model using non-negative matrix factorisation, partial dependence plots, and Shapely values
  • 04-kernel-density-probability-scoring.ipynb: a basic model of shot quality by location using kernel density estimators
  • 05-simulate-match-results-from-xg.ipynb: simulate league tables using expected goals
  • 06-freeze_frame-example.ipynb: a plot a StatsBomb freeze frame
  • 07-red-zone-heatmap.ipynb: heatmaps for the goal scoring probabilities
  • 08-shots_follow_poisson_distribution.ipynb: a bar chart to show that goals per game can be approximated by a Poisson distribution
  • 09_figure3_angle_features.ipynb: a figure to show how the angles for expected goals models are calculated