ETL for a startup called Sparkify. Sparkify wants to analyze its data from songs and user activity on their streaming app. To understand what songs are users listening to we're using user activity logs and song metadata.
The ETL uses conda environments with the latest python version 3.10.
-
Create a Redshift Cluster
-
Setup a Database and take note of the Database name, user, password and port
-
Make sure the Cluster's VPC is accessible from the computer running this
-
Record cluster host and DB params at dwh.cfg at the [CLUSTER] section
-
Create a Role that has access to read from S3 'aws:policy/AmazonS3ReadOnlyAccess' and record its ARN at dwh.cfg at the [IAM_ROLE] section
-
Make sure conda environment is active
-
Run Create tables
python create_tables.py
-
Run ETL.py
python etl.py
Contains command line snippets, most of them to manage the conda environment
Contains snippets to Create, Pause, Resume, Delete the Redshift Cluster
Python code to execute DDL statements that initialize the sparkify database
Config file where we have the
- KEY
- SECRET
- PREFERRED_REGION
For the aws user (gitIgnored 😉)
Config file where we have the:
- Redshift connection string
- IAM ARN Role to read from S3 buckets
- S3 buckets URI
Environment config. Coontains the project dependencies for creating a conda environment
Notebook for testing each step of the process
Notebook for analyzing, cleaning artists
Notebook for analyzing, cleaning songs
Notebook for analyzing, cleaning users
ETL processing metadata and events into the songplays datamart. Since we are working with a powerful database such as Redshift. We are transforming the staging tables into the facts and dimensions using INSERT - SELECT statements. This way we take advantage of the database cloud capabilities.
Example:
DDL and DML SQL statements for the ETL
Notebook querying the data inserted by the ETL
Staging tables will have string fields only to have the raw data captured in from the source. Any data formatting, data conversion, deduplication or filtering is done when loading the data from staging to the datawarehouse tables
- For staging events, there is character content larger at staging_events.artist than the default varchar default length (256)
- For staging songs, there is character content larger at staging_songs.title, staging_songs.artist_name and staging_songs.artist_location than the default varchar default length (256)
I went into the rabbit hole of deduplicating artist rows that have a many to many relationship between artist_name and artist_id
A row from the songs dataset representing an artist, removing duplicates from same artist songs
- Latitude and Longitude are casted into the right data type
- As there are multiple records for the same artist entity, we assign a score (int) for how well recorded is:
- Latitude
- Longitude
- Location
Represents an artist ID / Name combination
- multiple_name_indicatorand multiple_id_indicator flags are calculated to easily process in phases each set of artist rows
Working table that will become Artist Name dimension Stores the transformation records in 6 steps
- Artists dimension will be replicated in all clusters
- name is the natural primary key and sort key This is because songplays or streams reference song titles and artists by title/names
Represents a sont title. Song id has been disregarded as it adds no value to the analysis
- Song titles dimension is the largest dimension
- It has a distribution style by title so it can distribute across clusters along with the songplays fact table records
- Sort Key is by artist_name and title, as it is the natural composed primary key
- User dimension table will be replicated in all clusters as comparatively has much less data
- Calendar dimension to be able to query/aggregate easily blocks of time.
- The time dimension key is a int generated key that represents Year, Month, Day, Hour
- Time dimension will be replicated in all clusters
- PK
songplay_id
has an autoincrement int column - Songplays fact table will have a distribution style by Key, having as its sort key the timestamp
- The distribution key will be the song title which corresponds to the largest dimension. This way songplays along with songs records are allocated in the same cluster
- start_time_key is the int representation of Year, Month, Day, Hour to link to the time dimension