/udacity-data-engineer-dwh

A data warehouse implementation using AWS Redshift and S3

Primary LanguagePython

Sparkify ETL Pipeline (Udacity Data-engineer DataWarhouse project)

In this project, I create a data warehouse using AWS Redshift and S3 and implement a pipeline to ingest S3 files into the database.

Two datasets are used:

  1. A song dataset, which is a subset of real data from the Million Song Dataset.
  2. A log dataset, which consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Both files are stored in S3 bucket s3://udacity-dend and are ingested into fact and dimension tables.

The project consists of the following files:

  • redshift.py: helps manage creating and tear down of a redshift cluster.
  • create_tables.py: helps create the Redshift fact and dimension tables.
  • etl.py: helps run the ETL pipeline extracting facts and dimension data from the S3 files.

The final product of the pipeline consist of the following tables:

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong
    songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app
    user_id, first_name, last_name, gender, level
songs - songs in music database
    song_id, title, artist_id, year, duration
artists - artists in music database
    artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
    start_time, hour, day, week, month, year, weekday

Installation

Install prerequisites for pyscopg2 as defined in https://www.psycopg.org/docs/install.html

if you find an issue with -lssl, try running the following command: env LDFLAGS="-I/usr/local/opt/openssl/include -L/usr/local/opt/openssl/lib" pip install psycopg2 per stackflow

Create the virtual environment:

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

Install python packages:

pip3 install -r requirements.txt

In all sections below, it is assume the venv environment is activated.

The Redshift Cluster

This section assumes an admin role has been created in AWS and that AWS KEY and SECRET are stored in the aws.cfg file.

Creating the cluster

Before starting the Pipeline, create a Redshift cluster using the following command:

python redshift.py --cmd create

This will create a Redshift cluster and assign it an S3 Read-Only role. The script relies on credential information from aws.cfg as well as AWS cluster configuration information from file dhw.cfg (see the files in this repository)

The command will create a cluster.cfg file containing both the redshift hostname (as dwh_endpoint) as well as the ARN (dwh_role_arn) for the role created, which will be used by the pipeline scripts to access the S3 files.

Deleting the Redshift cluster

To delete the cluster, run the following command:

python redshift.py --cmd delete

Running The Pipeline

There are 2 steps:

  1. Create the Redshift tables, including staging as well as fact and dimension tables:

python create_tables.py

  1. Run the ETL pipeline:

python etl.py

At the end of the script, sample data will be printed to allow sanity check, and a couple tests are run to verify the content of the table.

Staging Tables

staging_songs table staging_events table

Fact Table

songs table

Dimension Tables

songs table artists table users table time table

Analysis

After running the pipeline, we are able to do some analysis on the songplay table.

For example, we can find the top 5 power users with the following query:

WITH top_users AS (
    SELECT user_id, COUNT(*) AS count
    FROM songplay
    GROUP BY user_id
    ORDER BY cnt DESC
    LIMIT 5
)
SELECT users.first_name, 
       users.last_name, 
       top_users.cnt
  FROM top_users
 INNER JOIN users
       ON users.user_id = top_users.user_id
 ORDER BY cnt DESC

and we get:

first_name last_name cnt
Chloe Cuevas 41
Tegan Levine 31
Kate Harrell 28
Lily Koch 20
Aleena Kirby 18

We can also look for the top 5 most popular locations where songs are played, using the following query:

SELECT location, 
       count(*) AS cnt 
  FROM songplay
 GROUP BY location 
 ORDER BY cnt DESC 
 LIMIT 5
location count
San Francisco-Oakland-Hayward, CA 41
Portland-South Portland, ME 31
Lansing-East Lansing, MI 28
Chicago-Naperville-Elgin, IL-IN-WI 20
Atlanta-Sandy Springs-Roswell, GA 18