In an nutshell, this project aims for three things:
- Acquiring data from the transfermarkt website using the trasfermarkt-scraper.
- Building a clean, public football (soccer) dataset using data in 1.
- Automating 1 and 2 to keep these assets up to date and publicly available on some well-known data catalogs.
🔈 New! → This sample notebook demonstrates how to interact with the dataset using natural language and leveraging OpenAI APIs.
classDiagram
direction LR
competitions --|> games : competition_id
competitions --|> clubs : domestic_competition_id
clubs --|> players : current_club_id
clubs --|> club_games : opponent/club_id
clubs --|> game_events : club_id
players --|> appearances : player_id
players --|> game_events : player_id
players --|> player_valuations : player_id
games --|> appearances : game_id
games --|> game_events : game_id
games --|> clubs : home/away_club_id
games --|> club_games : game_id
class competitions {
competition_id
}
class games {
game_id
home/away_club_id
competition_id
}
class game_events {
game_id
player_id
}
class clubs {
club_id
domestic_competition_id
}
class club_games {
club_id
opponent_club_id
game_id
}
class players {
player_id
current_club_id
}
class player_valuations{
player_id
}
class appearances {
appearance_id
player_id
game_id
}
Setup your local environment to run the project with poetry
.
- Install poetry
- Install python dependencies (poetry will create a virtual environment for you)
cd transfermarkt-datasets
poetry install
The Makefile
in the root defines a set of useful targets that will help you run the different parts of the project. Some examples are
dvc_pull pull data from the cloud (aws s3)
docker_build build the project docker image and tag accordingly
acquire_local run the acquiring process locally (refreshes data/raw)
prepare_local run the prep process locally (refreshes data/prep)
sync run the sync process (refreshes data frontends)
streamlit_local run streamlit app locally
dagit_local run dagit locally
Run make help
to see the full list. Once you've completed the setup, you should be able to run most of these from your machine.
All project data assets are kept inside the data
folder. This is a DVC repository, so all files can be pulled from the remote storage with the make dvc_pull
.
path | description |
---|---|
data/raw |
contains raw data per season as acquired with trasfermarkt-scraper (check acquire) |
data/prep |
contains prepared datasets as produced by dbt (check prepare) |
⚠️ Read access to the S3 DVC remote storage for the project is required to successfully rundvc pull
. Contributors can grant themselves access by adding their AWS IAM user ARN to this whitelist.
In the scope of this project, "acquiring" is the process of collecting "raw data", as it is produced by trasfermarkt-scraper. Acquired data lives in the data/raw
folder and it can be created or updated for a particular season by running make acquire_local
make acquire_local ARGS="--asset all --season 2022"
This runs the scraper with a set of parameters and collects the output in data/raw
.
In the scope of this project, "preparing" is the process of transforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds.
Data prepartion is done in SQL using dbt and DuckDB. You can trigger a run of the preparation task using the prepare_local
make target or work with the dbt CLI directly if you prefer.
cd dbt
→ The dbt folder contains the dbt project for data preparationdbt deps
→ Install dbt packages. This is only required the first time you run dbt.dbt run -m +appearances
→ Refresh the appearances file by running the model in dbt.
dbt runs will populate a dbt/duck.db
file in your local, which you can "connect to" using the DuckDB CLI and query the data using SQL.
duckdb dbt/duck.db
A thin python wrapper is provided as a convenience utility to help with loading and inspecting the dataset (for example, from a notebook).
# import the module
from transfermarkt_datasets.core.dataset import Dataset
# instantiate the datasets handler
td = Dataset()
# load all assets into memory as pandas dataframes
td.load_assets()
# inspect assets
td.asset_names # ["games", "players", ...]
td.assets["games"].prep_df # get the built asset in a dataframe
# get raw data in a dataframe
td.assets["games"].load_raw()
td.assets["games"].raw_df
The module code lives in the transfermark_datasets
folder with the structure below.
path | description |
---|---|
transfermark_datasets/core |
core classes and utils that are used to work with the dataset. |
transfermark_datasets/tests |
unit tests for core classes. |
transfermark_datasets/assets |
perpared asset definitions: one python file per asset |
For more examples on using transfermark_datasets
, checkout the sample notebooks.
Prepared data is published to a couple of popular dataset websites. This is done running make sync
, which runs weekly as part of the data pipeline.
There is a streamlit app for the project with documentation, a data catalog and sample analyisis. The app is currently hosted in fly.io, you can check it out here.
For local development, you can also run the app in your machine. Provided you've done the setup, run the following to spin up a local instance of the app
make streamlit_local
⚠️ Note that the app expects prepared data to exist indata/prep
. Check out data storage for instructions about how to populate that folder.
Define all the necessary infrastructure for the project in the cloud with Terraform.
In order to keep things tidy, there are two simple guidelines
- Keep the conversation centralised and public by getting in touch via the Discussions tab.
- Avoid topic duplication by having a quick look at the FAQs
Contributions to transfermarkt-datasets
are most welcome. If you want to contribute new fields or assets to this dataset, the instructions are quite simple:
- Fork the repo
- Set up your local environment
- Pull the raw data by either running
dvc pull
(requesting access is needed) or usingmake acquire_local
script (no access request needed) - Start modifying assets or creating new ones in the dbt project. You can use
make prepare_local
to run and test your changes. - If it's all looking good, create a pull request with your changes 🚀
ℹ️ In case you face any issue following the instructions above or just if you have questions in general you may get in touch