/Data-Engineering

This repo shall host all tasks and projects with regards to Data Engineering.

Primary LanguageJupyter Notebook

Data-Engineering

This repo shall host all tasks and projects with regards to Data Engineering.

Project 1

Data Warehousing Pipeline-Cloud(AWS RedShift)

Project Description

The main objective of this project was to create a pipeline of retrieving data from a Data Lake and creating tables in order for it to be loaded into a Data Warehouse. Amazon Web Services was the cloud service provider used for this project.

Datasets

The dataset used in this project consist of six(6) text files as pipe delimiterd values. The dataset in general consists of data of ticket sales and it's associated attribute information. The 6 datasets are:

  1. allevents_pipe.txt
  2. allusers_pipe.txt
  3. category_pipe.txt
  4. date2008_pipe.txt
  5. listings_pipe.txt
  6. venue_pipe.txt

A sample of the dataset can be seen below:

Sample Dataset

Files

  1. The notebook file has the code for an end to end data pipeline on the cloud.

  2. The cluster.config file is also used by the notebook to parse environment variables for the pipeline. Configparser module in python helps with that.

Architecture used includes:

  1. AWS S3 For Storage
  2. AWS Redshift For Data Warehousing

Database Schema

The dataset came in a partially relational schema and was loaded to the data warehouse in that structure since the main objective of this project was not on star schemas. An ideal star schema for the dataset was however designed and broken down into fact and dimension tables. Below are visual guides of the dataset's schema in its current state and the ideal star schema for it.


Current Schema of Dataset:


Cuurent Schema of Dataset

Ideal Star Schema Should Consist of:

  • 5 Dimension Tables
  • 1 Fact Table

Ideal Star Schema of Dataset


Ideal Star Schema of Dataset

Project 2

Data Modelling Pipeline (PostgreSQL)

Project Description

The main objective of this project was to create a Postgres database with tables designed to optimize queries on song play analysis. This was to be done by an ETL pipeline. Fact and dimension tables for a star schema was needed in order for the analysis to be achieved in the data warehouse.

Datasets

Datasets used in this project consists of 2 main files (log and song data) which are of json format.

  1. Song Data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:

song_data/A/B/C/TRABCEI128F424C983.json

song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", 
"artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, 
"year": 0}
  1. Log Data

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset I was working with are partitioned by year and month. For example, here are filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json

log_data/2018/11/2018-11-13-events.json

Files

There are 3 files in this project directory which are:

  1. Create_tables.py - For creating tables in postgres warehouse
  2. etl.py - Running the main pipeine of the project
  3. sql_queries.py - Queries used in the project (create, insert, etc.)

Architecture used include:

  1. PostgreSQL for Data Warehousing

Database Schema

A denormalized star schema was used for this project.This includes:

  1. One Fact Table
  2. Four Dimension Tables

Below is a visual guide to the star schema.


Star Schema of Dataset


Star Schema Image

Project 3

ETL Pipeline (Postgres)

Project Description

The main objective of this project is to build a simple ETL pipeline which pulls covid data from an API, transforms it and load into a postgres warehouse. All these are done in docker containers and scheduled with cron. The orchestration is done with cron and this pulls the data from the API every day, applies the transformation and loads in the warehouse. A dashboard is finally created out of the data in Metabase. Four (4) container instances are run in docker making this possible. These instances are:

  1. Warehouse (Postgres)
  2. PgAdmin (Database Management System/Tool)
  3. Pipelinerunner (Main etl pipeline script)
  4. Metabase (Dashboard)

Dataset

The main data is pulled from an API.A snapshot of the data and its format is seen below.

API Snapshot

Folders & Files

  1. Containers - This folder contains a pipeplinerunner and warehouse folder of which are included Dockerfile & requirements.txt and setup-covidM.sql respectively.

  2. scheduler - This folder contains the crontab file for running the main py script.

  3. src/covidmonitor - The main python script for the pipeline is available in here.

  4. test - The test folder conatains a fixture folder, integration folder and unit folder. This folder holds the files for integration and unit testing.

Architecture and Setup used include:

  1. PostgreSql
  2. Docker & Docker Compose
  3. Git
  4. AWS (To be included soon)
  5. AWS CLI (To be included soon)

Running in Local Machine

A personal database was used for this project and therefore configuration settings such as environment variables are not shared in github. It is advisable to create a local database and apply the configuration settings in the codebase in order to run it in docker. The docker containers should run just fine. In the event of the data not showing in pgadmin, then the wait-for command utility tool can be applied to let the pipeline runner wait for the warehouse since it is dependent on it.

There is a Makefile with common commands. These are executed in the running container.

cd covidmonitor
make up # starts all the containers
make ci # runs formatting, lint check, type check and python test

If the CI step passes you can go to http://localhost:3000 to checkout your Metabase instance.

Tear down

The docker containers can be teared down on your local instance with.

make down

Dashboard

Below is a snapshot of the dashboard created

Dashboard