ETL process for Creating Sparkify Data Model in Postgres

This is my first project for the Udacity Nanodegree of Data Engineering. It is about an etl process (postgres based) for Sparkify.

Sparkify is a simulated (not-real) online music streaming service.

This Git repository shows how to script an etl process for loading data from json raw data to a Postgre SQL Database and for creating fact and dimension tables in that manner.

This is done using Python, mainly with pandas and psycopg2

Purpose of the Database sparkifydb

The sparkifydb database is postgre SQL based and is about storing information about songs and listening behaviour of the users

The analytical goal of this database to get all kings of insight into the user beahviour

Description of the ETL Pipeline

Description of the raw Datasets

Raw data comes in json formats and is stored in several subdirectories und the /data directory

log data

This directory contains jsons which show basically user activity per day on Sparkify.

song data

This directory contains jsons which show basically available songs and artists on Sparkify.

Scripts and Files

Basically the shell script RunScripts.sh contains the relevant etl files. So basically running this script (./RunScripts.sh) resets the sparkify database to an empty state and then creates all the table structures (create_tables.py). After that all the necessary data is derived from the json files under the /data directory and loaded into the tables created with the etl.py. The functions from sql_queries.py are used in both of these scripts for dropping, creating and inserting in postgre tables.

Other Files

test.ipynb

A notebook for testing the contents of the sql tables.

Final Data Structure

Please find descriptions of the final tables below

songplays

This is supposed to be the fact table and shows every single songplay activity, i.e. a user listened to a speicifc song at a specific time and so on. It has an artifical primary key via identity column and all other sorts of attributes concerning the songplay activity.

users

This table contains master data on users

songs

This table contains master data on songs

artists

This table contains master data on artists

time

This table contains master data on the timestamp, i.e. what hour, day, month, etc.

analysis

This notebook shows some basic analysis.

Potential next steps and imporovements

1. Use Bulk method for loading data. This would be necessary with bigger amounts data

2. Create a way to make only increment loads

3. Create alerts or similar to monitor the etl pipeline