This repo represents my solution to the Data Engineering Challenge organized by Nordeus
You can download the dataset file events.jsonl from the link above.
There are two options for setting up the work-environment:
- Using Anaconda
- Open Anaconda Prompt and navigate to the directory of this repo by using:
cd PATH_TO_THIS_REPO
- Execute
conda env create -f environment.yml
This will set up an environment with all necessary dependencies. - Activate previously created environment by executing:
conda activate nordeus_data_challenge
- Open Anaconda Prompt and navigate to the directory of this repo by using:
- Using system-wide Python
- Open Bash/Command Prompt/Power Shell and navigate to the directory of this repo by using:
cd PATH_TO_THIS_REPO
- Run
pip install -r requirements.txt
- Open Bash/Command Prompt/Power Shell and navigate to the directory of this repo by using:
Your work environment should be properly set up now.
- Navigate to the directory of this repo by using:
cd PATH_TO_THIS_REPO
- Run
python process_data.py -d DATASET_PATH
. This script loads the dataset and executes data cleaning pipeline explained here. After going through the pipeline, the club performance data is saved to a database.
usage: process_data.py [-h] -d DATASET_PATH
Loads and cleans the dataset. Saves processed data to a database.
optional arguments:
-h, --help show this help message and exit
Required Arguments:
-d DATASET_PATH, --dataset_path DATASET_PATH
Path where the '.jsonl' dataset file is stored.
- Run
python main.py -l LEAGUE_ID
. This script retrieves and displays the desired league scoreboard if a league with such id exists. Otherwise an error message is displayed.
usage: main.py [-h] -l LEAGUE_ID
Retrieves scoreboard for the desired league
optional arguments:
-h, --help show this help message and exit
Required arguments:
-l LEAGUE_ID, --league_id LEAGUE_ID
Id of a league for which we want to get the
scoreboard.
After cleaning, the statistics related to club performance are stored in a database.