A repository for analysis on Expected Goals using StatsBomb and Wyscout data.
This repository assumes that the StatsBomb open-data has already been cloned to a local directory.
All of the notebooks can be run from an Anaconda environment.
To install the environment yourself use the Anaconda Prompt. Run the following command from the expected-goals directory:
conda env create -f environment.yml
Activate the new environment from the prompt:
conda activate expected-goals
Load Jupyter Notebook from the prompt:
jupyter notebook
The data are from StatsBomb open-data and Wyscout.
To run the analysis you need to first create the datasets. By running the notebooks in the notebooks/create-data directory (in numerical order):
- 00_statsbomb_data_to_parquet.ipynb: creates the StatsBomb data in the data/statsbomb folder
- 01_wyscout_data_to_parquet.ipynb: creates the Wyscout data in the data/wyscout folder
- 02_remove_overlap_wyscout_statsbomb.ipynb: removes 100 overlapping games from the Wyscout data
- 03_wyscout_shot_dataset.ipynb: creates a Wyscout shot dataset: data/wyscout/shots.parquet
- 04_statsbomb_freeze_frame_features.ipynb: creates some features from the StatsBomb freeze-frame data: data/statsbomb/freeze_features.parquet
- 05_statsbomb_shot_dataset.ipynb: creates a StatsBomb shot dataset: data/statsbomb/shots.parquet
- 06_combine_shots_dataset.ipynb: creates an overall shot dataset: data/shots.parquet
- 07_add_synthetic_shots_remove_outliers.ipynb: removes 227 outliers from the data/shots.parquet generates 1000 fake shots (data/fake_shots.parquet)
- 00-explore-data-quality-overlap.ipynb: explores the 100 overlapping Wyscout/StatsBomb games and their data quality
- 01-expected-goals-model.ipynb: builds two expected goals models: logistic regression and light gradient boosting machines
- 02-expected-goals-calculate-xg-and-shap.ipynb: calculates xG and shapely values (contributions of the features to the probability of a goal)
- 03-visualize-models.ipynb: visualize the model using non-negative matrix factorisation, partial dependence plots, and Shapely values
- 04-kernel-density-probability-scoring.ipynb: a basic model of shot quality by location using kernel density estimators
- 05-simulate-match-results-from-xg.ipynb: simulate league tables using expected goals
- 06-freeze_frame-example.ipynb: a plot a StatsBomb freeze frame
- 07-red-zone-heatmap.ipynb: heatmaps for the goal scoring probabilities
- 08-shots_follow_poisson_distribution.ipynb: a bar chart to show that goals per game can be approximated by a Poisson distribution
- 09_figure3_angle_features.ipynb: a figure to show how the angles for expected goals models are calculated