Introduction

A (fictional) music streaming start-up, Sparkify, wants a data warehouse to analyze what users are listening to.

Their music library and log data about user events are stored as JSON files on AWS s3. The goal of this project is make the data easy to analyze. To accomplish this, we load the data into a Redshift cluster with a star schema database.

Schema

The center of the star schema ("fact table") is the songplays table, which contains the following columns:

songplay_id: primary key
start_time: when the song was played, also foreign key to time table
user_id: foreign key to users table
level: free or paid tier
song_id: foreign key to songs table
artist_id: foreign key to artists table
session_id: id for a continuous user session
location: where the song was played
user_agent: how the song was played (e.g. which browser was used)

The points of the star schema ("dimension tables") are:

songs: song_id, (song) title, artist_id, year (released), (song) duration
artists: artist_id, (artist) name, (artist) location, lattitude, and longitude
users: user_id, first_name, last_name, gender, level (free or paid tier)
time: start_time, hour, day, week, month, year, weekday (true/false)

Organization

The src module contains code for creating and managing a Redshift cluster, along with SQL queries used in the scripts.
The bin directory contains scripts for creating AWS infrastructure, creating SQL tables, loading the tables, and querying the database.
Installing the repo locally allows the user to run everything from the command line. See below.

Usage

Run pip install -e git+https://github.com/brendan-m-murphy/udacity-dend-project-3.git#egg=project3 to install a local copy of the project.
Create an AWS user with admin privileges, download the credentials as a .csv, and move it to src/project3.
Run config to create dwh.cfg
Run iac to create a Redshift role and cluster.
Run create-tables to create tables for staging and the star schema.
Run etl -y to load all of the JSON data, or etl -t to load a test set.
Run analytics to try test queries.
Run cleanup to delete all AWS resources. (Or, use pause and resume to pause and resume the cluster.)

Note: these scripts should be run in the directory where the repo is installed.

brendan-m-murphy/redshift-data-warehouse

Introduction

Schema

Organization

Usage