My solution submission for the technical assessment task.
I desgined the code to be dockerized on container to run all scripts on one container containing the data and the flow of the execution.
You need to run docker-compose.yml
file first to build the environment that holds the data, it contains:
- Jupyter Notebook (port: 8888)
- Postgres Database (port: 5432)
- User: root
- Password: root
- DB: RetailDB
- Pg-Admin4 (port: 8080)
- Email: admin@admin.com
- Password: root
You can build the containers by typing the following on main project directory where doscker-compose.yml
is located
docker-compose up -d
Note: Docker should be installed.
To be able to run the scripts, you need to build and and run the Dockerfile
, you can execute the following commands on the directory that contains Dockerfile
- Build the image: artefact-project:v01
docker build -t artefact-project:v01 .
- Build and run the container
docker run -it --network=global-network --name artefact_project_container artefact-project:v01
To run the scripts periodically, we need to establish schedule run to run the code, on our case I've built cron job to run the code every day at 10:00 AM. On your CMD run the following commands
- Open crontab
crontab -e
- Put the schdule on bottom of oppened file, close the file and save
0 10 * * * docker build -t artefact-project:v01 . && docker run -it --network=global-network --name artefact_project_container artefact-project:v01
You can watch your schdules log by typing the following command
greb CRON /var/log/sylog
You will find the partitioning and indexing strategy in this section
You will find the quality and version management strategies in this section
You can find my sandbox Jupyter Notebooks on jupyter-data
directory that contains investigations and what I was thinking and executing before building final scripts
├── Dockerfile
├── LICENSE
├── README.md
├── build_populate_dwh.py
├── clean_data.py \
├── crontab.sh
├── docker-compose.yml
├── dwh-design
├── etl-data
├── etl_utils.py
├── extract_transform_load_data.py
├── images
├── ingest_base_data.py
├── jupyter-data
│ ├── data_cleaning_validation.ipynb
│ ├── data_ingestion.ipynb
│ ├── data_warehouse_build.ipynb
│ ├── etl_process.ipynb
│ └── online_retail.csv
├── partitioning-indexing
│ ├── DimCustomer
│ │ └── DimCustomer.sql
│ ├── DimDate
│ │ └── DimDate.sql
│ ├── DimProduct
│ │ └── DimProduct.sql
│ ├── FactRetailSales
│ │ └── FactRetailSales.sql
│ ├── online_retail_sales
│ | └── online_retail_sales.sql
│ └── README.md
├── quality-versioning-management
│ ├── ALTER_online_retail_sales.sql
│ ├── LOGGING_online_retail_sales.sql
│ └── README.md
├── run_project.sh
└── utils.py \
11 directories, 59 files