Overview / Purpose

ETL for a startup called Sparkify. Sparkify wants to analyze its data from songs and user activity on their streaming app. To understand what songs are users listening to we're using user activity logs and song metadata.

How to Run

The ETL uses conda environments with the latest python version 3.10.

Create a Redshift Cluster
Setup a Database and take note of the Database name, user, password and port
Make sure the Cluster's VPC is accessible from the computer running this
Record cluster host and DB params at dwh.cfg at the [CLUSTER] section
Create a Role that has access to read from S3 'aws:policy/AmazonS3ReadOnlyAccess' and record its ARN at dwh.cfg at the [IAM_ROLE] section
Make sure conda environment is active
Run Create tables

python create_tables.py
Run ETL.py

python etl.py
Run Test Notebook

File list

_notes.cmd

Contains command line snippets, most of them to manage the conda environment

AWS Cluster Operations

Contains snippets to Create, Pause, Resume, Delete the Redshift Cluster

create_tables.py

Python code to execute DDL statements that initialize the sparkify database

aws.cfg

Config file where we have the

- KEY
- SECRET
- PREFERRED_REGION

For the aws user (gitIgnored 😉)

dwh.cfg

Config file where we have the:

- Redshift connection string
- IAM ARN Role to read from S3 buckets
- S3 buckets URI

environment.yml

Environment config. Coontains the project dependencies for creating a conda environment

etl_draft.ipynb

Notebook for testing each step of the process

artist_cleaning.ipynb

Notebook for analyzing, cleaning artists

song_cleaning.ipynb

Notebook for analyzing, cleaning songs

user_cleaning.ipynb

Notebook for analyzing, cleaning users

etl.py

ETL processing metadata and events into the songplays datamart. Since we are working with a powerful database such as Redshift. We are transforming the staging tables into the facts and dimensions using INSERT - SELECT statements. This way we take advantage of the database cloud capabilities.

Example:

sql_queries.py

DDL and DML SQL statements for the ETL

test.ipynb

Notebook querying the data inserted by the ETL

Raw Staging

Staging tables will have string fields only to have the raw data captured in from the source. Any data formatting, data conversion, deduplication or filtering is done when loading the data from staging to the datawarehouse tables

Raw Staging Table `staging_events`

For staging events, there is character content larger at staging_events.artist than the default varchar default length (256)

Raw Staging Table `staging_songs`

For staging songs, there is character content larger at staging_songs.title, staging_songs.artist_name and staging_songs.artist_location than the default varchar default length (256)

Intermediate Staging

Artists

I went into the rabbit hole of deduplicating artist rows that have a many to many relationship between artist_name and artist_id

Intermediate Staging Table `staging_artist_row`

A row from the songs dataset representing an artist, removing duplicates from same artist songs

Latitude and Longitude are casted into the right data type
As there are multiple records for the same artist entity, we assign a score (int) for how well recorded is:
- Latitude
- Longitude
- Location

Intermediate Staging Table `staging_artist_id_name`

Represents an artist ID / Name combination

multiple_name_indicatorand multiple_id_indicator flags are calculated to easily process in phases each set of artist rows

Intermediate Staging Table `staging_artist_names`

Working table that will become Artist Name dimension Stores the transformation records in 6 steps

Datawarehouse

Dimension Table `artist_names`

Artists dimension will be replicated in all clusters
name is the natural primary key and sort key This is because songplays or streams reference song titles and artists by title/names

Dimension Table `song_titles`

Represents a sont title. Song id has been disregarded as it adds no value to the analysis

Song titles dimension is the largest dimension
It has a distribution style by title so it can distribute across clusters along with the songplays fact table records
Sort Key is by artist_name and title, as it is the natural composed primary key

Dimension Table `users`

User dimension table will be replicated in all clusters as comparatively has much less data

Dimension Table `time`

Calendar dimension to be able to query/aggregate easily blocks of time.
The time dimension key is a int generated key that represents Year, Month, Day, Hour
Time dimension will be replicated in all clusters

Fact Table `songplays`

PK songplay_id has an autoincrement int column
Songplays fact table will have a distribution style by Key, having as its sort key the timestamp
The distribution key will be the song title which corresponds to the largest dimension. This way songplays along with songs records are allocated in the same cluster
start_time_key is the int representation of Year, Month, Day, Hour to link to the time dimension

joseph-higaki/UDataEng_L03_P02_S3toRedshiftDW

Overview / Purpose

How to Run

File list

Raw Staging

Raw Staging Table staging_events

Raw Staging Table staging_songs

Intermediate Staging

Artists

Intermediate Staging Table staging_artist_row

Intermediate Staging Table staging_artist_id_name

Intermediate Staging Table staging_artist_names

Datawarehouse

Dimension Table artist_names

Dimension Table song_titles

Dimension Table users

Dimension Table time

Fact Table songplays