A (fictional) music streaming start-up, Sparkify, wants a data warehouse to analyze what users are listening to.
Their music library and log data about user events are stored as JSON files on AWS s3. The goal of this project is make the data easy to analyze. To accomplish this, we load the data into a Redshift cluster with a star schema database.
The center of the star schema ("fact table") is the songplays
table, which contains the following columns:
songplay_id
: primary keystart_time
: when the song was played, also foreign key totime
tableuser_id
: foreign key tousers
tablelevel
: free or paid tiersong_id
: foreign key tosongs
tableartist_id
: foreign key toartists
tablesession_id
: id for a continuous user sessionlocation
: where the song was playeduser_agent
: how the song was played (e.g. which browser was used)
The points of the star schema ("dimension tables") are:
songs
:song_id
, (song)title
,artist_id
,year
(released), (song)duration
artists
:artist_id
, (artist)name
, (artist)location
,lattitude
, andlongitude
users
:user_id
,first_name
,last_name
,gender
,level
(free or paid tier)time
:start_time
,hour
,day
,week
,month
,year
,weekday
(true/false)
- The
src
module contains code for creating and managing a Redshift cluster, along with SQL queries used in the scripts. - The
bin
directory contains scripts for creating AWS infrastructure, creating SQL tables, loading the tables, and querying the database. - Installing the repo locally allows the user to run everything from the command line. See below.
- Run
pip install -e git+https://github.com/brendan-m-murphy/udacity-dend-project-3.git#egg=project3
to install a local copy of the project. - Create an AWS user with admin privileges, download the credentials as a .csv, and move it to
src/project3
. - Run
config
to createdwh.cfg
- Run
iac
to create a Redshift role and cluster. - Run
create-tables
to create tables for staging and the star schema. - Run
etl -y
to load all of the JSON data, oretl -t
to load a test set. - Run
analytics
to try test queries. - Run
cleanup
to delete all AWS resources. (Or, usepause
andresume
to pause and resume the cluster.)
Note: these scripts should be run in the directory where the repo is installed.