The purpose of this project is to analyze data simulated by eventsim. The simulated data consists of songs and user activity from music streaming app. The main goal of the project is to construct a postgres database with tables designed to optimize queries on song play analysis. Using the song and log datasets, an ETL pipeline is used to create star schema for this analysis.
We leverage Docker containers for the ETL pipeline and database hosting.
- Run docker compose in order to spin up required environment.
docker-compose up
Note that
data
folder in this project is mapped to/data
in the docker container. A small sample of data generated by eventsim is included in the project.
-
Access Jupyter environment via link provided by Docker.
-
In order to run the ETL and test created database you can use
test.ipynb
notebook intest
folder. The notebook executes relevant scripts in order to create sparkifydb and fill the tables in the database via ETL pipeline.
Below is presented the schema of sparkifydb. In order to optimize queries for song play analysis and reduce data duplication, star scheme is used.
Records in log data associated with song plays i.e. records with page NextSong
songplays | KEY |
---|---|
songplay_id | PK |
start_time | FK |
user_id | FK |
level | |
song_id | FK |
artist_id | FK |
session_id | |
location | |
user_agent |
Users in the app
users | KEY |
---|---|
user_id | PK |
first_name | |
last_name | |
gender | |
level |
Songs in music database
songs | KEY |
---|---|
song_id | PK |
title | |
artist_id | FK |
year | |
duration |
Artists in music database
artists | KEY |
---|---|
artist_id | PK |
name | |
location | |
lattitude | |
longitude |
Timestamps of records in songplays broken down into specific units
time | KEY |
---|---|
start_time | PK |
hour | |
day | |
week | |
month | |
year | |
weekday |