Football2Vec

This football analytics package is based on a series of Towards Data Science articles by Ofir Magdaci:

Using this package, you can to download pre-trained models or run and create your own.

Installation

Anaconda environment

The package contains easy-to-use config file to quickly set up an anaconda environment for the project. Alternatively, you can manually install the list of packages listed in the file. To create a conda environment with all required dependencies (see conda_env.yml), open a terminal/cmd window and cd to project repository and run:
conda env create -f conda_env.yml.

Dataset

The StatsBomb open dataset is a free data available for research. You can download and read more about it here. Save the extracted 'statsbomb' directory into <package_path>/data, without modify its name.

Pre-trained package:

For using the UI, you must first create all necessary artifacts: data objects and trained models. To this end, run main.py manually (see under 'Manual Run'). Alternatively, pre-trained models are available to download (see below).
Data objects currently are not available for download due to licensing.

Download the pre-trained package

You can get a pre-made models by downloading the pre-trained package. Extract it and put the models_artifacts folder into football2vec/artifacts directory. This package includes all pre-trained models.
Due to StatsBomb licensing, I can't offer pre-made data processing artifacts (see 'build_data_objects' under 'Manual Run').

The pre-trained models

The open-source models of Football2Vec are more basic than presented in the article. However, they are can be easily and freely extended.

The pre-trained version uses merely the action position and type for building to tokens.
For extending word string representation, see 'extending the model' below.

Manual Run

For a manual run, simply executes main.py via any Python supported software you have or directly using the terminal. For terminal installation, open a terminal/cmd window, cd to the project directory and run python main.py. The package has two main processes:

build_data_objects() - The function builds all data-objects required for the models and the UI:
- enriched_events_data: Builds enriched events_data DataFrame. It apply to_metric_centered_coordinates on the data, and adds shots types, etc.
- matches_metadata: Adds season_name, competition_name, etc., for each match in the dataset.
- teams_metadata: Adds columns such as nation, stadium, gender, etc., for each team in the dataset.
- players_metadata: Combines players metadata given in the dataset and enriches it with events_data information: Adds player_name, team_name, and position_name per player (take most frequent).
- players_metrics_df: Builds a DataFrame of stats for players - xG, xA, lifts for each shot type, etc.
- baselines.pickle: Build a dictionary {baselines_dimension: df} where the df is identical to players_metrics_df, with baselines names instead of players. For example, the leagues baselines has a DataFrame, where each line corresponds to the average stats of players in this league.
build_language_models() - The function builds all models of Football2Vec: Action2Vec and Player2Vec and export their artifacts.

Run configurations

There are some very basic configurations for each run, available to modify directly on main.py:

force_create = False: Whether to force override all artifacts without trying load existing artifacts.
verbose = False for Prints control.
plotly_export = False: Whether to export Plotly figures to Plotly studio.
save_artifacts = False: Whether to save the artifacts in to params.PATHS.ARTIFACTS.
- Pay attention that this is False by default, meaning NO ARTIFACTS WILL BE SAVED.

Additional configurable parameters

params.py:

CONSTANTS.VOCABULARY - holds all events types that will be considered by the language models.
CONSTANTS.HARD_XG & CONSTANTS.EASY_XG - define the 'easy' and 'hard' probabilities thresholds for skill evaluation and the UI.

Running times

Total run time: 106 minutes

Total run time for build_data_objects(): 76 minutes
Total run time for build_language_models(): 30 minutes

With much older MacBook Pro (Retina, 15-inch, Late 2013):

Machine processor i386 Total run time: 437 minutes Total run time for build_data_objects: 108 minutes Total run time for build_language_models: 329 minutes

Running the Streamlit UI

To run the Streamlit UI, open a terminal/cmd window in the project directory and run:
streamlit run player_app.py.

This will open a localhost on a browser. More on deploying Streamlit apps can be found here.
Since the UI consumes all artifacts above, or creating them on the fly, it is highly recommended to either to download the pre-trained models or to run `main.py` before running the UI. When doing so, verify that `save_artifacts` is set to `True`. The UI is a Streamlit dashboard which presents skill evaluation and representation of the selected player in the UI.
During the first run (or any run, if `save_artifacts` is disabled) the app will create `players_metrics_by_seasons.csv` with the sie of 142KB. - It is recommended to run step 1, 2 before building the UI, so the UI won't build all data objects and models on the fly. - For best performance, enable save_artifacts in (see 'Run' section above). Streamlit will be able to load the data into its cache, allowing seamless experience.

UI components

It is a simple Streamlit app with the following features:

Information section: Sidebar for team & player selection, the player image and player metadata.
Player skill analysis section:
- Analysis' parameters control panel.
- Skills radar chart with baselines
- Badges - each shot type has a unique icon for players with a Lift value greater than the threshold [=1.1].
Player evolution section (collapsable container)
Analyzing the player's skills and performance over seasons. Contain two Plotly animated charts:
- Animation of player's skills radar charts over seasons (see plot.py > player_evolution), with integrated controls.
- Animated actions heatmaps over seasons (see plot.py > player_actions_heatmap_evolution), with integrated controls. It describes the frequency and location of actions over seasons.
xG evaluation (collapsable container)

The xG evaluation section presents two charts:
- Shot conversions distribution plot (on top)- Plots the xG conversion for a single-player.
- xG Lift by body part - analyzes the players' head and legs performance.
Player2Vec embeddings section
This section holds all insights origin from the language models listed below.
- Player2Vec UMAP embeddings plot with coloring and presentation configurations.
- Most similar players to the selected player, by cosine similarity, as well as by euclidean distance.

Language models

Action2Vec

A Gensim Word2Vec model which allows embedding the semantics of the football language in a 32-dimensional space.
Read more: Embedding the Language of Football Using NLP. UMAP projections of the complete 19K words Action2Vec vocabulary.

Extending the models

For extending the model, you may edit the following functions:

data_processing.py -> FootballTokenizer -> def tokenize_action. This function receives an event and produce a word out of it.
data_processing.py -> FootballTokenizer -> def build_corpus. This function controls the building process of the corpus.
Model hyper-parameters: models.py -> def train_Word2Vec

Player2Vec

PlayerMatch2Vec, a Gensim Doc2Vec model that produces 32-sized vectors representing a player within a specific match.
Player2Vec representation is achieved by simply averaging all the player PlayerMatch2Vec representations. Read more: Embedding the Language of Football Using NLP Here it how it looks like:
Plotly interactive UMAP projection of Player2vec where all player’s matches are averaged to a single vector. Players are colored by position. Interactive Plotly visualization.

Extending the model

For extending the model, you may edit the following functions:

data_processing.py -> FootballTokenizer -> def tokenize_action. This function receives an event and produce a word out of it.
data_processing.py -> FootballTokenizer -> def build_corpus. This function controls the building process of the corpus.
Model hyper-parameters: models.py -> def train_Doc2Vec

Explainers

NOTICE: this module, unlike others, is mostly hard-coded and very strict in its input support. It requires full compliance to the pre-trained models' format and vocabulary naming conventions.
These requirements are strict: when broken, no meaningful outputs will be achieved, and errors will be raised.

As demonstrated in A Deep Dive into the Language of Football, this package includes four explainability methods, both local and global: representation-based explainers, analogies, similarities, and creating players' variations:

ActionAnalogies: an object that allows actions analogies using analogies equations: Word A1 → Word A2 ~ Word B1 → Word B2. Read more here .
An example for pass direction analogy:

Illustrative analogy plot for learning pass direction. B1/2/3 are the best actions to fit the analogy equation: A - A’ + B’ =?. Solid lines represent A or B, while dashed lines represent A’ or B’. Green colors are for A, A’, reds for B, B’. The pass distance (short/med/long) is represented by the arrow length. Here, A’ is the same pass as A, but with the opposite direction (left). B’ is the same as A’ from one position behind. B1/2 are mirrored passes to B with variations of height and length. B3 is exactly the mirrored pass. Image by Author.

PlayersAnalogies: an object that allows players analogies using analogies equations: Word A1 → Word A2 ~ Word B1 → Word B2. Read more here .
Examples:

Illustrative players analogies plots. In each figure, B values are the top players to fit the analogy equation: A - A’ + B’ =?. Solid lines represent A and B, while dashed lines represent A’ and B’. Green colors are for A, A’, reds for B, B’.

PlayerSkillsExplainer: an object that allows combining players with actions, generating endless local variations for a player, across one or more skills. For example, creating offensive variations with more shots or crosses, or enhancing defensive skills by replacing bad tackles with successful ones. These variations can serve as explainers. Read more here and interact with the full Plotly chart here.
LinearDocExplainer - an object that allows summing collection of actions and players representation, creating players variations to serve as explainers. Read more here.
Example:
- Most similar player to Neymar: Ronaldinho
  - Neymar — dribbling (all locations) ~ Thierry Henry (in Barcelona)
  - Neymar — flank dribbling ~ Philippe Coutinho
- Most similar player to Griezmann: Carlos Vela
  - Griezmann + dribble (all locations) ~ Arjen Robben
  - Griezmann + flank dribble ~ Mikel Oyarzabal
Player2Vec_std_analysis - a function that allows analyzing Player2Vec variance, plotting it using Plotly. Read more here and interact with the full Plotly figure here.
analyze_vector_dimensions_semantics - a function that analyzes each dimension of the representation by returning players with the highest and lowest corresponding dimension values. Read more here.

default_run

This method is defined for all explainers in explainers.py. It runs the explainer, executing all analogies and actions done here. You can use the default_run function as a convenient benchmark.

Artifacts

The package outputs various artifacts, both data-related and models-related.
You can enable / disable the save of artifacts using the save_artifacts configuration mentioned above.
You can control all paths in this package using the param.py module.
Model artifacts naming is dynamic, according to the model name. Hence, to change model artifacts path, change params.py -> MODELS_ARTIFACTS.

Artifacts Directory and configurations

Artifacts directory can be modified via params.py -> ARTIFACTS
Models' artifacts directory can be modified via params.py -> MODELS_ARTIFACTS. It contains to following objects:
- Word2Vec / Doc2Vec model object: <model_name>.model
- Word2Vec model's wordvectors object: <model_name>.wordvectors (read more here and here).
- Corpus object, in which all words and sentences are processed and their mappings are saved: <model_name>_corpus.pickle
- Embeddings dictionary object, in which all words and documents vectors are saved: <model_name>_embeddings.pickle
- Similarity db object, keeps all cosine similarities values across documents: <model_name>_docs_similarities.pickle
- UMAP figure as HTML, created by models -> plot_embeddings, <model_name>_umap_plot.html

Pay attention MODELS_ARTIFACTS includes the ARTIFACTS path in it.

Output directories

! It is recommended to modify the ARTIFACTS and MODELS_ARTIFACTS rather than the following paths. !

Events data outputs:
- Path of all processed enriched events data: params.py -> ENRICH_PLAYERS_METADATA_PATH
Metadata and metrics paths:
- Path of players metadata: params.py -> PATHS.PLAYERS_METADATA_PATH
- Path of teams metadata: params.py -> PATHS.TEAMS_METADATA_PATH
- Path of matches metadata: params.py -> PATHS.MATCHES_METADATA_PATH
- Path of players skill evaluation metrics: params.py -> PATHS.PLAYERS_METRICS_PATH
- Path of baselines skill evaluation metric: params.py -> PATHS.BASELINE_PLAYERS_METRICS_PATH
- Path of skill evaluation metric by seasons: params.py -> PATHS.PLAYERS_METRICS_BY_SEASON
Analyses paths
- Path of explainers' outputs: params.py -> EXPLAINERS
- Path of skill analysis: params.py -> PATHS.EXPLAINERS

Data Artifacts

This includes the following files:

matches_metadata.csv - 434KB
players_metadata.csv - 1.4MB
players_metrics_df.csv - 2.5MB
baselines_metrics.pickle - 20KB
enriched_events_data.pickle - 1.65GB

Models artifacts

These include the following files:

Action2Vec
- Action2Vec.model (Gensim Word2Vec object) - 257KB (3.9MB for the pre-trained)
- Word2Vec Action2Vec.wordvectors - 138KB (2.2MB for the pre-trained)
- Action2Vec_corpus.pickle - 8MB (14.3MB for the pre-trained)
- Plotly HTML file Action2Vec_umap_plot.html - 4.4MB
Player2Vec
- Player2Vec.model (Gensim Doc2Vec object. In fact, it is PlayerMatch2Vec) - 5.1MB (13.7MB for the pre-trained)
- Doc2Vec Player2Vec.wordvectors - 4.7MB
- Player2Vec_embeddings.pickle - 1MB
- Player2Vec_corpus.pickle - 8.4MB (13.6MB for the pre-trained)
- Plotly HTML file Player2Vec_umap_plot.html ~ 5MB

Analyses artifacts