The purpose of this project is to analyze data simulated by eventsim. The simulated data consists of songs and user activity from music streaming app. The main goal of the project is to create a redshift database with tables designed to optimize queries on song play analysis. Using the song and log datasets, an ETL pipeline is used to create star schema for this analysis.
Before proceeding with deployment please create a AWS redshift instance with read access to S3 storage.
The project should be deployed on AWS. Configuration variables are read from dwh.cfg
.
- Copy example configuration file and fill it in.
cp dwh.example.cfg dwh.cfg
- Create tables in DW.
python create_tables.py
- Run ETL pipeline.
python etl.py
Note that you can use debugging flag (
python -D
... ) to see SQL commands that are executed. In addition, you can use command line arguments in order to execute only specific commands ( seepython create_tables.py -h
andpython etl.py -h
)
Below is presented the schema of sparkifydb. In order to optimize queries for song play analysis and reduce data duplication, star scheme is used in this DB.
Note that schema is distributed in a way to optimize queries on
songplays
table. Dimension tables are distributed to all nodes in the cluster.
Records in log data associated with song plays i.e. records with page NextSong
songplays | KEY |
---|---|
songplay_id | PK |
start_time | FK |
user_id | FK |
level | |
song_id | FK |
artist_id | FK |
session_id | |
location | |
user_agent |
Users in the app
users | KEY |
---|---|
user_id | PK |
first_name | |
last_name | |
gender | |
level |
Songs in music database
songs | KEY |
---|---|
song_id | PK |
title | |
artist_id | FK |
year | |
duration |
Artists in music database
artists | KEY |
---|---|
artist_id | PK |
name | |
location | |
lattitude | |
longitude |
Timestamps of records in songplays broken down into specific units
time | KEY |
---|---|
start_time | PK |
hour | |
day | |
week | |
month | |
year | |
weekday |
Allow programmic creation of Redshift cluster by finish create_cluter.py
script.
This project is licensed under the MIT License - see the LICENSE.md file for details