Data-Challenge

This repo represents my solution to the Data Engineering Challenge organized by Nordeus

You can download the dataset file events.jsonl from the link above.

Initial Setup
How To Run It
Data Processing Pipeline

Initial Setup

There are two options for setting up the work-environment:

Using Anaconda
- Open Anaconda Prompt and navigate to the directory of this repo by using: cd PATH_TO_THIS_REPO
- Execute conda env create -f environment.yml This will set up an environment with all necessary dependencies.
- Activate previously created environment by executing: conda activate nordeus_data_challenge
Using system-wide Python
- Open Bash/Command Prompt/Power Shell and navigate to the directory of this repo by using: cd PATH_TO_THIS_REPO
- Run pip install -r requirements.txt

Your work environment should be properly set up now.

How To Run It

Navigate to the directory of this repo by using: cd PATH_TO_THIS_REPO
Run python process_data.py -d DATASET_PATH. This script loads the dataset and executes data cleaning pipeline explained here. After going through the pipeline, the club performance data is saved to a database.

usage: process_data.py [-h] -d DATASET_PATH

Loads and cleans the dataset. Saves processed data to a database.

optional arguments:
  -h, --help            show this help message and exit

Required Arguments:
  -d DATASET_PATH, --dataset_path DATASET_PATH
                        Path where the '.jsonl' dataset file is stored.

Run python main.py -l LEAGUE_ID. This script retrieves and displays the desired league scoreboard if a league with such id exists. Otherwise an error message is displayed.

usage: main.py [-h] -l LEAGUE_ID

Retrieves scoreboard for the desired league

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -l LEAGUE_ID, --league_id LEAGUE_ID
                        Id of a league for which we want to get the
                        scoreboard.

Data Processing Pipeline

After cleaning, the statistics related to club performance are stored in a database.

In order to correctly retrieve the league scoreboard it was necessary to preprocess the dataset
Dataset cleaning steps are explained in the diagram below
Functions which make up the cleaning/processing part of the API are available here
Functions for database manipulation are available here

senadkurtisi/Data-Challenge

Data-Challenge

Table of Contents

Initial Setup

How To Run It

Data Processing Pipeline

Licence