/Data-modeling-with-Postgres

Postgres project to model simulated song streaming data.

Primary LanguageJupyter NotebookMIT LicenseMIT

Data Modeling with Postgres

Purpose

The purpose of this project is to analyze data simulated by eventsim. The simulated data consists of songs and user activity from music streaming app. The main goal of the project is to construct a postgres database with tables designed to optimize queries on song play analysis. Using the song and log datasets, an ETL pipeline is used to create star schema for this analysis.

Usage

We leverage Docker containers for the ETL pipeline and database hosting.

  1. Run docker compose in order to spin up required environment.
docker-compose up

Note that data folder in this project is mapped to /data in the docker container. A small sample of data generated by eventsim is included in the project.

  1. Access Jupyter environment via link provided by Docker.

  2. In order to run the ETL and test created database you can use test.ipynb notebook in test folder. The notebook executes relevant scripts in order to create sparkifydb and fill the tables in the database via ETL pipeline.

Schema for Song Play Analysis

Below is presented the schema of sparkifydb. In order to optimize queries for song play analysis and reduce data duplication, star scheme is used.

Fact Table

Records in log data associated with song plays i.e. records with page NextSong

songplays KEY
songplay_id PK
start_time FK
user_id FK
level
song_id FK
artist_id FK
session_id
location
user_agent

Dimension Tables

Users in the app

users KEY
user_id PK
first_name
last_name
gender
level

Songs in music database

songs KEY
song_id PK
title
artist_id FK
year
duration

Artists in music database

artists KEY
artist_id PK
name
location
lattitude
longitude

Timestamps of records in songplays broken down into specific units

time KEY
start_time PK
hour
day
week
month
year
weekday