Project: Data Warehouse

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Structure
Model
- Dimension and fact tables
- Staging tables
Requirements
How to use
Analyzing the data
Cleaning the environment

Structure

This tree shows the repository structure. Only the project's main files are described.

.
├── images
├── src
│   ├── config.py                  # Application config manager
│   ├── create_tables.py           # Database initialization script
│   ├── etl.py                     # ETL pipeline script
│   ├── sparkify_stack_create.py   # Script for the Sparkify stack creation
│   ├── sparkify_stack_delete.py   # Script for the Sparkify stack deletion
│   ├── sparkify_stack.json        # CloudFormation template of the Sparkify stack
│   ├── sparkify.cfg               # Application config file
│   ├── sql_queries.py             # Database queries
├── .editorconfig
├── .gitignore
├── README.md
└── requirements.txt

Model

In addition to the Udacity nanodegree, I found this links about how to model data in AWS Redshift really interesting:

AWS Redshift Reference: CREATE TABLE
AWS Redshift Reference: Distribution Styles
AWS Big Data Blog: Distribution Styles and Distribution Keys

Staging tables

staging_events

This table has the field artist as the distribution key, and the field song as the sort key. Both fields will be used later to JOIN other tables to prepare the dimensions and fact data.

Field	Type	PK	DISTKEY	SORTKEY
eventId	INTEGER	X
artist	VARCHAR		X
auth	VARCHAR
firstName	VARCHAR
gender	VARCHAR(1)
itemInSession	INTEGER
lastName	VARCHAR
length	FLOAT
level	VARCHAR
location	VARCHAR
method	VARCHAR(7)
page	VARCHAR
registration	BIGINT
sessionId	INTEGER
song	VARCHAR			X
status	SMALLINT
ts	TIMESTAMP
userAgent	VARCHAR
userId	INTEGER

staging_songs

The same applies to this table: the field artist_id is used as the distribution key, and the field song_id as the sort key. The distribution style is KEY.

Field	Type	PK	DISTKEY	SORTKEY
song_id	VARCHAR	X		X
num_songs	INTEGER
title	VARCHAR(1024)
artist_name	VARCHAR(1024)
artist_latitude	FLOAT
year	SMALLINT
duration	FLOAT
artist_id	VARCHAR		X
artist_longitude	FLOAT
artist_location	VARCHAR(1024)

Dimension and fact tables

time

This is a small table that can be replicated in every slice, so I set the distribution style as ALL.

Field	Type	PK
start_time	TIMESTAMP	X
hour	SMALLINT
day	SMALLINT
week	SMALLINT
month	SMALLINT
year	SMALLINT
weekday	SMALLINT

users

This table doesn't participate in any JOIN, so it has no distribution or sort keys. It's tiny, so the distribution style can be ALL.

Field	Type	PK
user_id	INTEGER	X
first_name	VARCHAR
last_name	VARCHAR
gender	VARCHAR
level	VARCHAR

artists

The size of this table is reasonably small, so I decided to set its distribution style as ALL and make it available in every single slice to ease the JOIN clauses.

Field	Type	PK
artist_id	VARCHAR	X
name	VARCHAR(1024)
location	VARCHAR(1024)
latitude	FLOAT
longitude	FLOAT

songs

Artists and songs are related to each other, so their identifiers are always good candidates to be involved in the data distribution. I set the artist_id as both the distribution and sort keys. Due to this, the distribution style is KEY.

Field	Type	PK	DISTKEY	SORTKEY	FK
song_id	VARCHAR	X
title	VARCHAR(1024)
artist_id	VARCHAR		X	X	artists(artist_id)
year	SMALLINT
duration	FLOAT

songplays

As I said, I see the relation artist > songs important enough to choose their identifiers to participate in the data distribution: the field artist_id is the distribution key, while the field song_id is the sort key. Therefore, the distribution style is KEY.

Field	Type	PK	DISTKEY	SORTKEY	FK
songplay_id	INTEGER	X
start_time	TIMESTAMP				time(start_time)
user_id	INTEGER				users(user_id)
level	VARCHAR
song_id	VARCHAR			X	songs(song_id)
artist_id	VARCHAR		X		artists(artist_id)
session_id	INTEGER
location	VARCHAR
user_agent	VARCHAR

Requirements

It is assumed that the tools below are properly installed locally:

Python is a programming language that lets you work quickly and integrate systems more effectively. Versión 3.x is required.
pip is the package installer for Python.
Virtualenv is a tool to create isolated Python environments.
Amazon Web Services is the world's most comprehensive and broadly adopted cloud platform, offering over 165 fully featured services from data centers globally.

Preparing AWS

Before running the ETL, we must tune up some things in our AWS account.

Creating an administrator user

NOTE: this step is necessary only if you don't have a user with privileges to manage AWS resources from the SDK.

Open the AWS console, log into your account and go to IAM. Choose User from the left side menu, and then click the button Add user.

Fill in the User name box. Also mark the Programmatic access checkbox.

Now we must attach the policy AdministratorAccess to the user. Just select the option Attach existing policies directly, and then mark the policy's checkbox, down in the policies list.

And lastly, once the user is created, download its credentials clicking the button Download .csv. You can also copy the Access key ID and the Secret access key directly from the site and put them in a safe place. Important: never share this credentials to anyone, your AWS account can be compromised.

Have the user credentials on hand, we will use them right now. NOTE: you can use the credentials of any user having the proper permissions.

Go to the file src/sparkify.cfg and paste the credentials in the following keys:

...
ACCESS_KEY_ID = the-access-key-id
SECRET_ACCESS_KEY = the-secret-access-key
...

Activating STS

Our Redshift cluster needs permissions to read a S3 bucket provisioned in US West (Oregon). This is the region us-west-2, where Udacity team stores the source data that we will be processing later. This can be achieved by letting the Redshift cluster to assume a role with the AmazonS3ReadOnlyAccess policy. Internally, this operation uses AWS Security Token Service (STS) to generate temporary credentials for the communication with S3. In order to make it work properly, we must enable STS in the mentioned region.

Go to IAM and choose Account settings from the left side menu. Down in the section Security Token Service (STS), click the link Activate next to the region US West (Oregon).

Cloning the repository

The first step is to clone this repository. Just type the following command in your Terminal:

# Clone the repository...
git clone https://github.com/vermicida/data-warehouse.git

# ...and move to its directory
cd data-warehouse

Creating the virtual environment

Once the project is cloned, you must create a new virtual environment for it. You may have multiple Python versions installed, so it's important to specify the correct one with the --python argument. The code on this repository has been successfully tested over Python 3.6, but it must work on any Python 3 version; set the version that applies to your case in the following commands:

# Create the virtual environment...
virtualenv --python python3.6 project-env

# ...and activate it
source project-env/bin/activate

It's time to install the project's dependencies. You will use pip to do this job:

pip install -r requirements.txt

It will be ready in a minute.

How to use

Here are listed the steps to follow in order to make the ETL pipeline work.

Creating the Sparkify stack

The Sparkify stack will create the following resources:

A AWS Redshift cluster that we will use as data warehouse
A IAM role to access to S3 in a properly way

Go back to your Terminal and run the following script:

# Move to the directory src...
cd src

# ...and run the script to create the stack
python sparkify_stack_create.py

This action takes ~7 minutes. Once it's finished, copy the Role ARN and the Cluster endpoint.

Now, open the file src/sparkify.cfg and paste this info in the corresponding keys:

...
ENDPOINT_ADDRESS = the-cluster-endpoint
ROLE_ARN = the-role-arn
...

Initializing the database

We must model our data before we can execute our ETL pipeline. Type this command in your Terminal:

python create_tables.py

This step is almost immediate.

Running the ETL

This step will test your patience. It takes ~1 hour, and performs the actions described below:

Copy the source data, stored in S3, into the staging tables. I decided to split the copy of the song data in batches just to print out some info and give the user some feedback of the process.
Populate the dimension tables collecting records about artists, songs, users and timestamps from the staging tables, and using them as source.
Populate a fact table based on the dimensions previously created.

Run this command:

python etl.py

Now better go to eat something, take a nap, or watch the last episode of Stranger Things.

Just for fun, try to verify the number of records imported to every single table.

Table	Type	Rows
staging_events	Staging	8.056
staging_songs	Staging	385.252
users	Dimension	105
songs	Dimension	384.995
artists	Dimension	45.266
time	Dimension	8.023
songplays	Fact	7.268

Analyzing the data

We can extract tons of valuable data from the data we have stored in our data warehouse. A basic fact could be to know what are the most listened artists.

Use any AWS Redshift client and run this query:

SELECT
    artists.name,
    temp.listeners
FROM (
    SELECT
        COUNT(songplays.artist_id) AS listeners,
        songplays.artist_id
    FROM public."songplays" as songplays
    GROUP BY songplays.artist_id
) AS temp
JOIN public."artists" AS artists
ON temp.artist_id = artists.artist_id
ORDER BY temp.listeners DESC
LIMIT 10;

You will get something like this:

#	Artist	Listeners
1	Kings Of Leon	77
2	Coldplay	71
3	The Killers / Lou Reed	62
4	Eminem / Royce Da 5-9	59
5	Kanye West / Paul Wall / GLC	56
6	Three Drives	55
7	Waylon Jennings	49
8	John Mayer featuring DJ Logic	48
9	Justin Bieber / Jessica Jarrell	44
10	Evanescence	41

Cleaning the environment

To remove the Sparkify stack, just run the following command in your Terminal:

python sparkify_stack_delete.py

In a few minutes, all the provisioned resources will be removed and will no longer incur cost in your AWS account.

faro7ah/data-warehouse