/UDataEng_L03_P02_S3toRedshiftDW

Assignment Udacity data engineering load Redshift from S3

Primary LanguageJupyter Notebook

Overview / Purpose

ETL for a startup called Sparkify. Sparkify wants to analyze its data from songs and user activity on their streaming app. To understand what songs are users listening to we're using user activity logs and song metadata.

How to Run

The ETL uses conda environments with the latest python version 3.10.

  • Create a Redshift Cluster

  • Setup a Database and take note of the Database name, user, password and port

  • Make sure the Cluster's VPC is accessible from the computer running this

  • Record cluster host and DB params at dwh.cfg at the [CLUSTER] section

  • Create a Role that has access to read from S3 'aws:policy/AmazonS3ReadOnlyAccess' and record its ARN at dwh.cfg at the [IAM_ROLE] section

  • Make sure conda environment is active

  • Run Create tables

    python create_tables.py

  • Run ETL.py

    python etl.py

  • Run Test Notebook

File list

Contains command line snippets, most of them to manage the conda environment

Contains snippets to Create, Pause, Resume, Delete the Redshift Cluster

Python code to execute DDL statements that initialize the sparkify database

Config file where we have the

- KEY
- SECRET
- PREFERRED_REGION

For the aws user (gitIgnored 😉)

image

Config file where we have the:

- Redshift connection string
- IAM ARN Role to read from S3 buckets
- S3 buckets URI

Environment config. Coontains the project dependencies for creating a conda environment

Notebook for testing each step of the process

Notebook for analyzing, cleaning artists

Notebook for analyzing, cleaning songs

Notebook for analyzing, cleaning users

ETL processing metadata and events into the songplays datamart. Since we are working with a powerful database such as Redshift. We are transforming the staging tables into the facts and dimensions using INSERT - SELECT statements. This way we take advantage of the database cloud capabilities.

Example:

image

DDL and DML SQL statements for the ETL

Notebook querying the data inserted by the ETL

Raw Staging

Staging tables will have string fields only to have the raw data captured in from the source. Any data formatting, data conversion, deduplication or filtering is done when loading the data from staging to the datawarehouse tables

Raw Staging Table staging_events

  • For staging events, there is character content larger at staging_events.artist than the default varchar default length (256)

image

Raw Staging Table staging_songs

  • For staging songs, there is character content larger at staging_songs.title, staging_songs.artist_name and staging_songs.artist_location than the default varchar default length (256)

image

Intermediate Staging

Artists

I went into the rabbit hole of deduplicating artist rows that have a many to many relationship between artist_name and artist_id

image

image

Intermediate Staging Table staging_artist_row

A row from the songs dataset representing an artist, removing duplicates from same artist songs

  • Latitude and Longitude are casted into the right data type
  • As there are multiple records for the same artist entity, we assign a score (int) for how well recorded is:
    • Latitude
    • Longitude
    • Location

image

Intermediate Staging Table staging_artist_id_name

Represents an artist ID / Name combination

  • multiple_name_indicatorand multiple_id_indicator flags are calculated to easily process in phases each set of artist rows

image

image

Intermediate Staging Table staging_artist_names

Working table that will become Artist Name dimension Stores the transformation records in 6 steps image

image

Datawarehouse

Dimension Table artist_names

  • Artists dimension will be replicated in all clusters
  • name is the natural primary key and sort key This is because songplays or streams reference song titles and artists by title/names

image

Dimension Table song_titles

Represents a sont title. Song id has been disregarded as it adds no value to the analysis

  • Song titles dimension is the largest dimension
  • It has a distribution style by title so it can distribute across clusters along with the songplays fact table records
  • Sort Key is by artist_name and title, as it is the natural composed primary key

image

Dimension Table users

  • User dimension table will be replicated in all clusters as comparatively has much less data

image

Dimension Table time

  • Calendar dimension to be able to query/aggregate easily blocks of time.
  • The time dimension key is a int generated key that represents Year, Month, Day, Hour
  • Time dimension will be replicated in all clusters

image

Fact Table songplays

  • PK songplay_id has an autoincrement int column
  • Songplays fact table will have a distribution style by Key, having as its sort key the timestamp
  • The distribution key will be the song title which corresponds to the largest dimension. This way songplays along with songs records are allocated in the same cluster
  • start_time_key is the int representation of Year, Month, Day, Hour to link to the time dimension

image