Sparkify is a startup which wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to.
In this document I will provide discussion on the process and decisions for the ETL pipeline.
The project includes three files:
- create_table.py: creates the fact and dimension tables for the star schema in Redshift.
- etl.py: loads data from S3 into staging tables on Redshift and then processes that data into analytics tables on Redshift.
- sql_queries.py: defines the SQL statements, which will be imported into the two other files above.
[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=
[IAM_ROLE]
ARN=
[S3]
LOG_DATA=s3://udacity-dend/log_data
LOG_JSONPATH=s3://udacity-dend/log_json_path.json
SONG_DATA=s3://udacity-dend/song_data
- Run create_tables.py to create the database and tables.
- Run etl.py to load the data into the project's tables.
column | type |
---|---|
artist | varchar |
auth | varchar |
first_name | varchar |
gender | char(1) |
item_session_id | int |
last_name | varchar |
length | numeric |
level | varchar |
location | varchar |
method | varchar |
page | varchar |
registration | numeric |
session_id | int |
song | varchar |
status | int |
ts | bigint |
user_agent | varchar |
user_id | int |
column | type |
---|---|
num_songs | int |
artist_id | varchar |
artist_latitude | numeric |
artist_longitude | numeric |
artist_location | varchar |
artist_name | varchar |
song_id | varchar |
title | varchar |
duration | float |
year | int |
Records in log data associated with song plays
column | type |
---|---|
songplay_id | serial (PK) |
start_time | TIMESTAMP (FK) |
user_id | int (FK) |
level | varchar |
song_id | varchar (FK) |
artist_id | varchar (FK) |
session_id | int |
location | varchar |
user_agent | varchar |
users in the app
column | type |
---|---|
user_id | int (PK) |
first_name | varchar |
last_name | varchar |
gender | char(1) |
level | varchar |
songs in music database
column | type |
---|---|
song_id | varchar (PK) |
title | varchar |
artist_id | varchar (FK) |
year | int |
duration | numeric |
artists in music database
column | type |
---|---|
artist_id | varchar (PK) |
name | varchar |
location | varchar |
latitude | numeric |
longitude | numeric |
timestamps of records in songplays broken down into specific units
column | type |
---|---|
start_time | TIMESTAMP (PK) |
hour | int |
day | int |
week | int |
month | int |
year | int |
weekday | int |