Data Modeling with Postgres

Purpose

The purpose of this project is to analyze data simulated by eventsim. The simulated data consists of songs and user activity from music streaming app. The main goal of the project is to construct a postgres database with tables designed to optimize queries on song play analysis. Using the song and log datasets, an ETL pipeline is used to create star schema for this analysis.

Usage

We leverage Docker containers for the ETL pipeline and database hosting.

Run docker compose in order to spin up required environment.

docker-compose up

Note that data folder in this project is mapped to /data in the docker container. A small sample of data generated by eventsim is included in the project.

Access Jupyter environment via link provided by Docker.
In order to run the ETL and test created database you can use test.ipynb notebook in test folder. The notebook executes relevant scripts in order to create sparkifydb and fill the tables in the database via ETL pipeline.

Schema for Song Play Analysis

Below is presented the schema of sparkifydb. In order to optimize queries for song play analysis and reduce data duplication, star scheme is used.

Fact Table

Records in log data associated with song plays i.e. records with page NextSong

songplays	KEY
songplay_id	PK
start_time	FK
user_id	FK
level
song_id	FK
artist_id	FK
session_id
location
user_agent

Dimension Tables

Users in the app

users	KEY
user_id	PK
first_name
last_name
gender
level

Songs in music database

songs	KEY
song_id	PK
title
artist_id	FK
year
duration

Artists in music database

artists	KEY
artist_id	PK
name
location
lattitude
longitude

Timestamps of records in songplays broken down into specific units