alvin-98/sparkify_data_warehouse_redshift

Python

Redshift Data Warehouse for Music Streaming App

A music streaming startup has a large userbase and has data stored in an Amazon S3 instance. The project builds an ETL pipeline that moves data from S3 into a star schema table on Amazon Redshift which is optimized for queries on song plays.

Tables

songplays - records in event data associated with song plays i.e. records with page 'NextSong' (Fact Table)
users - stores details of users of the app (Dimension Table)
songs - stores details of songs on the app (Dimension Table)
artists - stores details of artists whose music is on the app (Dimension Table)
time - stores timestamps of records in songplays broken down into specific units (Dimension Table)

Instructions

Create a new IAM Role and give 'read-only' access to Amazon S3 buckets
Create a Redshift Cluster and make sure the security group is configured to allow public access to the cluster and allow all TCP requests
Run the 'create_tables.py' file to create all required tables
Next, run 'running etl.py' to execute data loading and transformation