DM-PostgreSQL: A Jupyter Notebook repository from najuzilu

Data Modeling with PostgreSQL

About the Project

Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Description

You will create a Postgre database which will host data collected on songs and user activity on Sparkify's new music streaming app. First you will start by creating the database schema during which process we will define fact and dimension tables for a star schema. Then, you will create ETL pipelines which will transfer data from files in local directories into the tables in Postgre using both Python and SQL. Finally, you will conclude by testing the database and ETL pipelines using provided tests from the analytics team.

Tools

python
PostgreSQL
Jupyter notebook

Datasets

You will combine data from two local directories to create the fact and dimension tables.

The first dataset contains metadata about a song and the artist of that song which is saved in a JSON format. These files are partitioned by the first three letters of each song's track ID and are located under ./data/song_data/. A single song file has the following information:

Fields	Description
num_songs	number of songs by artist
artist_id	artist id
artist_latitude	latitude coordinate if provided which depends on the artist's location
artist_longitude	longitude coordinate if provided which depends on the artist's location
artist_location	address or location of artist
artist	full name of artist
song_id	id used to identify song
title	title of song
duration	duration of song
year	year when song was published

The second dataset contains log files generated by an event simulator based on the songs in the first dataset. This event simulation, simulates activity logs from the Sparkify music streaming app. These log files are partitioned by year and month and are located under ./data/log_data/. These files contain the following information:

Fields	Description
artist	name of artist
auth	tracks whether the user logged in or logged out
firstName	user first name
lastName	user last name
gender	user gender
length	length of session/event
itemInSession	number of items for a specific session
level	tracks whether the user paid for the session or if the session was free
location	user location
method	HTTP methods
page	tracks page name such as 'NextSong', 'Home', 'Logout', 'Settings', 'Downgrade', 'Login', 'Help', 'Error', 'Upgrade'
registration	registration timestamp
sessionId	session id
song	song name
status	tracks the status of the request such as status 200, 307, 404
ts	timestamp in millisecond
userAgent	operating system user agent
userId	user id

ERD Model

You will use the star database schema as data model for this ETL pipeline, which contains fact and dimension tables. An entity relationship diagram is shown below.

Getting Started

Clone this repository

git clone https://github.com/najuzilu/DM-PostgreSQL.git

Prerequisites

conda
python 3.8
psycopg2
pandas
ipykernel

Create a virtual environment through Anaconda using

conda env create --file environment.yml

To install the IPython kernel, run

python -m ipykernel install --user --name dm-postgre-sql

Project Steps

Run create_tables.py to create the tables
```
python create_tables.py
```
Run etl.py to execute the ETL pipeline and load the data in PostgreSQL
```
python etl.py
```
Alternatively, you can open and run etl.ipynb for an easy-to-use, interactive environment.
Open and run test.ipynb to make sure the tables have been populated.

Authors

Yuna Luzi - @najuzilu

License

Distributed under the MIT License. See LICENSE for more information.