This repo provides an example of using Redshift to load and analyze user listen and song data from a hypothetical music app. The code builds on sample code from a Udacity class project.
- Fill in
dwh.cfg
with your AWS information, cluster information (including host and DB login), and an IAM_ROLE that allows you to access your cluster. - Run
python create_tables.py
to create all tables - Run
python etl.py
to stage and load data, and run sample analysis.
NOTE: This code currently only loads 1 month of songplay data and a small subset of song data, for efficiency. This can be modified by changing the s3 addresses in dwh.cfg
to reference the entire dataset. Specifically, to do this, remove the /2018/11
and /A/A
suffixes where they appear in the LOG_DATA
and SONG_DATA
paths. Note that you will need a large cluster and patience to run the full dataset!
Data is staged in two tables: staging_songs
, which holds data from the million songs database, and staging_events
, which holds data describing app transactions. We'll be pulling data from both tables to produce the analysis tables (described below). We'll align data between the tables using the song title, so we use it as the distribution key for both tables. This will make copying to the final songplay
table more efficient.
The analysis table design is provided. It is a star schema design, with the songplay
table as the fact table and users
, artists
, songs
, and times
as the dimension tables. Since we do not know ahead of time how many users and artists there will be relative to the number of songplays, we will leave the distribution strategy as 'auto' -- this will automatically distribute smaller tables (eg, if there are only a few users, the users
table will be copied to all nodes) and distribute larger tables (eg, if there are many artists, the table will automatically be distributed).
We provide 2 example queries, one that joins the artists
and songplay
tables to gather the 5 most listened to artists, and one that uses only the songplay
table to gather the 5 most common user locations.