This football analytics package is based on a series of Towards Data Science articles by Ofir Magdaci:
- Embedding the Language of Football Using NLP
- A Deep Dive into the Language of Football
- Data-Driven Evaluation of Football Players' Skills
Using this package, you can to download pre-trained models or run and create your own.
The package contains easy-to-use config file to quickly set up an anaconda environment for the project. Alternatively, you can manually install the list of packages listed in the file.
To create a conda environment with all required dependencies (see conda_env.yml
), open a terminal/cmd window and cd
to project repository and run:
conda env create -f conda_env.yml
.
The StatsBomb open dataset is a free data available for research. You can download and read more about it here.
Save the extracted 'statsbomb' directory into <package_path>/data
, without modify its name.
For using the UI, you must first create all necessary artifacts: data objects and trained models. To this end, run main.py
manually (see under 'Manual Run'). Alternatively, pre-trained models are available to download (see below).
Data objects currently are not available for download due to licensing.
You can get a pre-made models by downloading the pre-trained package. Extract it and put the models_artifacts
folder into football2vec/artifacts directory.
This package includes all pre-trained models.
Due to StatsBomb licensing, I can't offer pre-made data processing artifacts (see 'build_data_objects' under 'Manual Run').
The open-source models of Football2Vec are more basic
than presented in the article.
However, they are can be easily and freely extended.
- The pre-trained version uses merely the action position and type for building to tokens.
- For extending word string representation, see 'extending the model' below.
For a manual run, simply executes main.py
via any Python supported software you have or directly using the terminal.
For terminal installation, open a terminal/cmd window, cd
to the project directory and run python main.py
.
The package has two main processes:
build_data_objects()
- The function builds all data-objects required for the models and the UI:- enriched_events_data: Builds enriched events_data DataFrame. It apply to_metric_centered_coordinates on the data, and adds shots types, etc.
- matches_metadata: Adds season_name, competition_name, etc., for each match in the dataset.
- teams_metadata: Adds columns such as nation, stadium, gender, etc., for each team in the dataset.
- players_metadata: Combines players metadata given in the dataset and enriches it with events_data information: Adds player_name, team_name, and position_name per player (take most frequent).
- players_metrics_df: Builds a DataFrame of stats for players - xG, xA, lifts for each shot type, etc.
- baselines.pickle: Build a dictionary {baselines_dimension: df} where the df is identical to players_metrics_df, with baselines names instead of players. For example, the leagues baselines has a DataFrame, where each line corresponds to the average stats of players in this league.
build_language_models()
- The function builds all models of Football2Vec: Action2Vec and Player2Vec and export their artifacts.
There are some very basic configurations for each run, available to modify directly on main.py
:
force_create = False
: Whether to force override all artifacts without trying load existing artifacts.verbose = False
for Prints control.plotly_export = False
: Whether to export Plotly figures to Plotly studio.save_artifacts = False
: Whether to save the artifacts in toparams.PATHS.ARTIFACTS
.- Pay attention that this is False by default, meaning NO ARTIFACTS WILL BE SAVED.
params.py:
CONSTANTS.VOCABULARY
- holds all events types that will be considered by the language models.CONSTANTS.HARD_XG
&CONSTANTS.EASY_XG
- define the 'easy' and 'hard' probabilities thresholds for skill evaluation and the UI.
Total run time: 106 minutes
- Total run time for
build_data_objects()
: 76 minutes - Total run time for
build_language_models()
: 30 minutes
With much older MacBook Pro (Retina, 15-inch, Late 2013):
- Machine processor i386 Total run time: 437 minutes Total run time for build_data_objects: 108 minutes Total run time for build_language_models: 329 minutes
To run the Streamlit UI, open a terminal/cmd window in the project directory and run:
streamlit run player_app.py
.
This will open a localhost on a browser. More on deploying Streamlit apps can be found here.
Since the UI consumes all artifacts above, or creating them on the fly, it is highly recommended to either to download the pre-trained models or to run `main.py` before running the UI. When doing so, verify that `save_artifacts` is set to `True`. The UI is a Streamlit dashboard which presents skill evaluation and representation of the selected player in the UI.
During the first run (or any run, if `save_artifacts` is disabled) the app will create `players_metrics_by_seasons.csv` with the sie of 142KB. - It is recommended to run step 1, 2 before building the UI, so the UI won't build all data objects and models on the fly. - For best performance, enable save_artifacts in (see 'Run' section above). Streamlit will be able to load the data into its cache, allowing seamless experience.
It is a simple Streamlit app with the following features:
-
Information section: Sidebar for team & player selection, the player image and player metadata.
-
Player skill analysis section:
- Analysis' parameters control panel.
- Skills radar chart with baselines
- Badges - each shot type has a unique icon for players with a Lift value greater than the threshold [=1.1].
- Analysis' parameters control panel.
-
Player evolution section (collapsable container)
Analyzing the player's skills and performance over seasons. Contain two Plotly animated charts: -
xG evaluation (collapsable container)
The xG evaluation section presents two charts: -
Player2Vec embeddings section
This section holds all insights origin from the language models listed below.- Player2Vec UMAP embeddings plot with coloring and presentation configurations.
- Most similar players to the selected player, by cosine similarity, as well as by euclidean distance.
A Gensim Word2Vec model which allows embedding the semantics of the football language in a 32-dimensional space.
Read more: Embedding the Language of Football Using NLP.
UMAP projections of the complete 19K words Action2Vec vocabulary.
For extending the model, you may edit the following functions:
data_processing.py -> FootballTokenizer -> def tokenize_action
. This function receives an event and produce a word out of it.data_processing.py -> FootballTokenizer -> def build_corpus
. This function controls the building process of the corpus.- Model hyper-parameters:
models.py -> def train_Word2Vec
PlayerMatch2Vec, a Gensim Doc2Vec model that produces 32-sized vectors representing a player within a specific match.
Player2Vec representation is achieved by simply averaging all the player PlayerMatch2Vec representations.
Read more: Embedding the Language of Football Using NLP
Here it how it looks like:
Plotly interactive UMAP projection of Player2vec where all player’s matches are averaged to a single vector. Players are colored by position.
Interactive Plotly visualization.
For extending the model, you may edit the following functions:
data_processing.py -> FootballTokenizer -> def tokenize_action
. This function receives an event and produce a word out of it.data_processing.py -> FootballTokenizer -> def build_corpus
. This function controls the building process of the corpus.- Model hyper-parameters:
models.py -> def train_Doc2Vec
NOTICE: this module, unlike others, is mostly hard-coded and very strict in its input support.
It requires full compliance to the pre-trained models' format and vocabulary naming conventions.
These requirements are strict: when broken, no meaningful outputs will be achieved, and errors will be raised.
As demonstrated in A Deep Dive into the Language of Football, this package includes four explainability methods, both local and global: representation-based explainers, analogies, similarities, and creating players' variations:
- ActionAnalogies: an object that allows actions analogies using analogies equations: Word A1 → Word A2 ~ Word B1 →
Word B2. Read more
here .
An example for pass direction analogy:
- PlayersAnalogies: an object that allows players analogies using analogies equations: Word A1 → Word A2 ~ Word B1 → Word B2. Read more here .
Examples:
-
PlayerSkillsExplainer: an object that allows combining players with actions, generating endless local variations for a player, across one or more skills. For example, creating offensive variations with more shots or crosses, or enhancing defensive skills by replacing bad tackles with successful ones. These variations can serve as explainers. Read more here and interact with the full Plotly chart here.
-
LinearDocExplainer - an object that allows summing collection of actions and players representation, creating players variations to serve as explainers. Read more here.
Example:- Most similar player to Neymar: Ronaldinho
- Neymar — dribbling (all locations) ~ Thierry Henry (in Barcelona)
- Neymar — flank dribbling ~ Philippe Coutinho
- Most similar player to Griezmann: Carlos Vela
- Griezmann + dribble (all locations) ~ Arjen Robben
- Griezmann + flank dribble ~ Mikel Oyarzabal
- Most similar player to Neymar: Ronaldinho
-
Player2Vec_std_analysis
- a function that allows analyzing Player2Vec variance, plotting it using Plotly. Read more here and interact with the full Plotly figure here. -
analyze_vector_dimensions_semantics
- a function that analyzes each dimension of the representation by returning players with the highest and lowest corresponding dimension values. Read more here.
This method is defined for all explainers in explainers.py
. It runs the explainer, executing all analogies and actions done here.
You can use the default_run
function as a convenient benchmark.
The package outputs various artifacts, both data-related and models-related.
You can enable / disable the save of artifacts using the save_artifacts
configuration mentioned above.
You can control all paths in this package using the param.py
module.
Model artifacts naming is dynamic, according to the model name. Hence, to change model artifacts path, change params.py -> MODELS_ARTIFACTS
.
- Artifacts directory can be modified via
params.py -> ARTIFACTS
- Models' artifacts directory can be modified via
params.py -> MODELS_ARTIFACTS
. It contains to following objects:- Word2Vec / Doc2Vec model object:
<model_name>.model
- Word2Vec model's wordvectors object:
<model_name>.wordvectors
(read more here and here). - Corpus object, in which all words and sentences are processed and their mappings are saved:
<model_name>_corpus.pickle
- Embeddings dictionary object, in which all words and documents vectors are saved:
<model_name>_embeddings.pickle
- Similarity db object, keeps all cosine similarities values across documents:
<model_name>_docs_similarities.pickle
- UMAP figure as HTML, created by models ->
plot_embeddings, <model_name>_umap_plot.html
- Word2Vec / Doc2Vec model object:
- Pay attention
MODELS_ARTIFACTS
includes theARTIFACTS
path in it.
! It is recommended to modify the ARTIFACTS
and MODELS_ARTIFACTS
rather than the following paths. !
-
Events data outputs:
- Path of all processed enriched events data:
params.py -> ENRICH_PLAYERS_METADATA_PATH
- Path of all processed enriched events data:
-
Metadata and metrics paths:
- Path of players metadata:
params.py -> PATHS.PLAYERS_METADATA_PATH
- Path of teams metadata:
params.py -> PATHS.TEAMS_METADATA_PATH
- Path of matches metadata:
params.py -> PATHS.MATCHES_METADATA_PATH
- Path of players skill evaluation metrics:
params.py -> PATHS.PLAYERS_METRICS_PATH
- Path of baselines skill evaluation metric:
params.py -> PATHS.BASELINE_PLAYERS_METRICS_PATH
- Path of skill evaluation metric by seasons:
params.py -> PATHS.PLAYERS_METRICS_BY_SEASON
- Path of players metadata:
-
Analyses paths
- Path of explainers' outputs:
params.py -> EXPLAINERS
- Path of skill analysis:
params.py -> PATHS.EXPLAINERS
- Path of explainers' outputs:
This includes the following files:
- matches_metadata.csv - 434KB
- players_metadata.csv - 1.4MB
- players_metrics_df.csv - 2.5MB
- baselines_metrics.pickle - 20KB
- enriched_events_data.pickle - 1.65GB
These include the following files:
- Action2Vec
Action2Vec.model
(Gensim Word2Vec object) - 257KB (3.9MB for the pre-trained)Word2Vec Action2Vec.wordvectors
- 138KB (2.2MB for the pre-trained)Action2Vec_corpus.pickle
- 8MB (14.3MB for the pre-trained)- Plotly HTML file
Action2Vec_umap_plot.html
- 4.4MB
- Player2Vec
Player2Vec.model
(Gensim Doc2Vec object. In fact, it is PlayerMatch2Vec) - 5.1MB (13.7MB for the pre-trained)Doc2Vec Player2Vec.wordvectors
- 4.7MBPlayer2Vec_embeddings.pickle
- 1MBPlayer2Vec_corpus.pickle
- 8.4MB (13.6MB for the pre-trained)- Plotly HTML file
Player2Vec_umap_plot.html
~ 5MB
These include the following files:
explain.py
Player2Vec Variance_umap_plot.html
(inMODELS_ARTIFACTS
directory)PlayersAnalogies
object outputs Players analogies results (ifexport_artifacts
argument sent asTrue
). It produces a csv file for each analogy.
Naming format:Analogy/<analogy name>/ <A1> - <A2> + <B2> ~ ?.csv
PlayerSkillsExplainer
outputs a csv with most similar results to given query.
Naming format:most_similar_<player_name>_<variation_action>_<skill_name>.csv
- A Plotly UMAP projection figure will be opened via the browser, for each given query.
skill_analysis.py
: no artifacts. Plotly figures are opened in the browser.
players_metrics_by_seasons.csv
: Builds a DataFrame of metricsplayers_metrics_df
, by also aggregated by season, for evolution plots.team_2_players.pickle
- dict for fast access to all players of each time.
In order to allow export to Plotly studio, please fill PLOTLY_USERNAME
and PLOTLY_API_KEY
in params.py
.