A music streaming startup has a large userbase and has data stored in an Amazon S3 instance. The project builds an ETL pipeline that moves data from S3 into a star schema table on Amazon Redshift which is optimized for queries on song plays.
- songplays - records in event data associated with song plays i.e. records with page 'NextSong' (Fact Table)
- users - stores details of users of the app (Dimension Table)
- songs - stores details of songs on the app (Dimension Table)
- artists - stores details of artists whose music is on the app (Dimension Table)
- time - stores timestamps of records in songplays broken down into specific units (Dimension Table)
- Create a new IAM Role and give 'read-only' access to Amazon S3 buckets
- Create a Redshift Cluster and make sure the security group is configured to allow public access to the cluster and allow all TCP requests
- Run the 'create_tables.py' file to create all required tables
- Next, run 'running etl.py' to execute data loading and transformation