Building a Data Warehouse on AWS

The scope of the project is to create a Data Warehouse for a music application. The source data are stored in a bunch of JSON files in AWS S3 buckets. We process them and extract useful information about users, songs, artists, users' listing information. Next, push extracted data to fact and dimensional tables for further analysis.

How to Deploy and Run

We have used virtualenv and pip for setting up our Python environment. Setting up the Python environment is fairly easy with virtualenv. You just need to run the following commands to set up the environment.

Create a new virtual environment: virtualenv -p python3.8 venv
Activate it: source ./venv/bin/activate
Install dependencites: pip install -r requirements.txt
Create the Redshift cluster and copy its host name, database name, username, password, and aim role.
Update your dwh.cfg file.

Next, execute two following commands to create your database schema and insert data into the fact and dimensional tables.

To create database schema: python create_tables.py
To load the data into tables: python etl.py

ETL Pipeline

We employed a two-stage approach to building our extract, transform, load (ETL) pipeline:

First, we load data (in JSON format ) from S3 buckets to two temporary (Redshift) tables. We used the Redshift COPY command to move data from S2 to Redshift tables.
For moving data from Redshift temporary tables to Fact and Dimensionsional tables, we use INSERT SQL commend.

Database Structure

Figure 1 shows the start scheme use used for the project. In our database scheme, we have one fact table (songplays) and four dimensional tables (users, songs, artists, time).

We are going to run a lot of analytical queries against our data stored in the database. We have selected star schema because it is the most appropriate database schema for our analytical queries.

Figure 1: The Database Schema

upul/Data-Warehouse-On-AWS

Building a Data Warehouse on AWS

How to Deploy and Run

ETL Pipeline

Database Structure