/UND-PostGres-DM

Udacity Nano Degree Post Gres Datamodelling

Primary LanguageJupyter Notebook

UND-PostGres-DM

                                             ** Udacity Nano Degree Postgres Datamodelling Project**

Project description:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity from their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. To acheive this objective we will create a database schema and ETL pipeline that populates the schema with the data to help with this analysis.

Database design:

The database schema consists of the following Fact Table

SONGPLAYS - stores records in log data associated with song plays i.e. records with page NextSong having the following column definitions

           songplay_id serial PRIMARY KEY
           start_time timestamp NOT NULL REFERENCES time(start_time)
           user_id text NOT NULL REFERENCES users(user_id)
           level varchar
           song_id text REFERENCES songs(song_id)
           artist_id text REFERENCES artists(artist_id)
           session_id text
           location varchar
           user_agent varchar

The database schema consists of the following Dimension Tables

USERS - stores users in the app that can access the songs and having the following column definitions:

           user_id text PRIMARY KEY
           first_name varchar
           last_name varchar
           gender varchar
           level varchar 

SONGS -stores songs in music database and having the following column definitions:

           song_id text PRIMARY KEY
           title varchar
           artist_id text
           year int
           duration numeric

ARTISTS - storing artists in music database containing the following column definitions: artist_id text PRIMARY KEY name varchar NOT NULL location varchar latitude float longitude float

TIME - storing timestamps of records in songplays broken down into specific time units like weekday, year, month, week, hour, day, time etc, containing the following column definitions: start_time timestamp PRIMARY KEY hour int day int week int month int year int weekday varchar

The following ERD Diagram explains the star schema relationships between the Fact and Dimension tables.

ER-Diagram

ETL Process: This section describes the processing of the logs and creating different tables so that analytical queries can be run on them. It also describes, which directories has what kind of data and how are you extracting and transforming it.

The following 2 datasets located at the following directory structure are loaded into the database.

For Eg:

song_data/A/B/C/TRABCEI128F424C983.json

The song dataset"SONGS DATA" is a subset of real data from the Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

log_data/2018/11/2018-11-12-events.json

The log dataset consists of log files in JSON format generated by this event simulator (https://github.com/Interana/eventsim) based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The ETL program consists of 3 functions and a main function that triggers the 3 functions in order as follows:

1. process_song_file

Using a pandas dataframe the song data files are read into a dataframe one by one. Then the artist and song data are extracted from the dataframe and inserted into the tables (artists, songs) file by file.

2. process_log_file

Using a pandas dataframe the log files are read into a dataframe one by one. Then the timestamp is split into time elements like weekday, year, month, week, hour, day, time etc. User and time data are loaded into the USERS and TIME tables through the dataframe record iteration loop.

Joining the songs and artists table through sql joins the data populated

3. process_data

Reads data from the path specified to the function and triggering the function passed in as parameters. After getting either log files or data files the specific function passed to this function for eg., either process song file or process log file is executed.

4. function main

Manages the connection to the database schema calls the above process data function twice once for processing song file data and another for log file data consecutively closes the database connection.

Project Repository files:

1: create_tables.py contains function calls to create the database schemas and calls to table functions which are defined in the python file sql_queries.py. 2: sql_queries contain the actual table definition functions 3: etl.py contains the python code to execute the data pipeline

How To Run the Project:

The following python scripts need to be run in order from a terminal emulator

  1. create_tables.py ($ python sql_queries.py)
  2. etl.py ($ python etl.py

The create_tables.py is the master program containing the funtion main and has the program logic to create the database and tables using DDL and Query definitions sql_queries.py file

The etl.py file contains the program logic and python code to execute the data pipeline that transfers data in json files to the star schema created earlier.

SQL queries can then be created and tested in the file test.ipynb for analytical queries that need to be answered.