/transfermarkt-datasets

Extract, prepare, publish and update Transfermarkt datasets.

Primary LanguagePython

transfermarkt-datasets

Use data from trasfermarkt-scraper to build a clean, public football (soccer) dataset. This includes data as clubs, games, players and player appearances from a number of national and international competitions and seasons.

Automate the data pipeline to keep these assets up to date and publicly available on well-known data catalogs for the data community's convenience.

Kaggledata.world

diagram

All project data assets are kept inside the data folder. This is a DVC repository, therefore all files for the current revision can be pulled from remote storage with the dvc pull command.

ℹ️ Read access to the DVC remote storage for this project is required to successfully run dvc pull. Contributors should feel free to grant themselves access by adding their AWS IAM user ARN to this whitelist. Have a look at this PR for an example.

raw data within this folder can be updated by running the trasfermarkt-scraper with the 1_acquire.py script.

$ python 1_acquire.py --asset all --season 2021

Scripts for transforming scraped raw data into a cleaned, validated data package that can be used as the basis of further analysis in this project. You may run these scripts to produce the prepared dataset within data/prep using 2_prepare.py.

$ python 2_prepare.py [--raw-files-location data/raw]

For reference on the types of assets produced by this script checkout published datasets linked above.

The preparation step uses raw data as input, hence raw files need to be available locally in order to run this step. You may pull raw assets by running dvc pull as mentioned earlier or by acquiring new and updated raw assets via 1_acquire.py

Define all the necessary infrastructure for the project in the cloud with Terraform.