/technical-assessment

My submission for the technical interview assessment.

Primary LanguageJupyter NotebookMIT LicenseMIT

Technical Assessment Solution

My solution submission for the technical assessment task.

Project Pipeline

project-pipeline

How to run

I desgined the code to be dockerized on container to run all scripts on one container containing the data and the flow of the execution.

1. Running docker-compose

You need to run docker-compose.yml file first to build the environment that holds the data, it contains:

  1. Jupyter Notebook (port: 8888)
  2. Postgres Database (port: 5432)
    • User: root
    • Password: root
    • DB: RetailDB
  3. Pg-Admin4 (port: 8080)

You can build the containers by typing the following on main project directory where doscker-compose.yml is located

docker-compose up -d 

Note: Docker should be installed.

2. Configure pg-admin4 to connect to RetailDB

configure_1 configure_2 configure_3

3. Running scripts container

To be able to run the scripts, you need to build and and run the Dockerfile, you can execute the following commands on the directory that contains Dockerfile

  1. Build the image: artefact-project:v01
    docker build -t artefact-project:v01 .
  2. Build and run the container
    docker run -it --network=global-network --name artefact_project_container artefact-project:v01

4. Schedule Run

To run the scripts periodically, we need to establish schedule run to run the code, on our case I've built cron job to run the code every day at 10:00 AM. On your CMD run the following commands

  1. Open crontab
    crontab -e
  2. Put the schdule on bottom of oppened file, close the file and save
    0 10 * * *  docker build -t artefact-project:v01 . && docker run -it --network=global-network --name artefact_project_container artefact-project:v01      

You can watch your schdules log by typing the following command

greb CRON /var/log/sylog

Data Warehouse Schema

dwh-schema

Partitioning and Indexing

You will find the partitioning and indexing strategy in this section

Quality and Version Management

You will find the quality and version management strategies in this section

Project Investigation and dig-deep

You can find my sandbox Jupyter Notebooks on jupyter-data directory that contains investigations and what I was thinking and executing before building final scripts

Project Structure

├── Dockerfile ├── LICENSE
├── README.md
├── build_populate_dwh.py
├── clean_data.py \
├── crontab.sh
├── docker-compose.yml
├── dwh-design
├── etl-data
├── etl_utils.py
├── extract_transform_load_data.py
├── images
├── ingest_base_data.py
├── jupyter-data
│ ├── data_cleaning_validation.ipynb
│ ├── data_ingestion.ipynb
│ ├── data_warehouse_build.ipynb
│ ├── etl_process.ipynb
│ └── online_retail.csv
├── partitioning-indexing
│ ├── DimCustomer
│ │ └── DimCustomer.sql
│ ├── DimDate
│ │ └── DimDate.sql
│ ├── DimProduct
│ │ └── DimProduct.sql
│ ├── FactRetailSales
│ │ └── FactRetailSales.sql
│ ├── online_retail_sales
│ | └── online_retail_sales.sql
│ └── README.md
├── quality-versioning-management
│ ├── ALTER_online_retail_sales.sql
│ ├── LOGGING_online_retail_sales.sql
│ └── README.md
├── run_project.sh
└── utils.py \

11 directories, 59 files