** Udacity Nano Degree Postgres Datamodelling Project**
Project description:
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity from their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. To acheive this objective we will create a database schema and ETL pipeline that populates the schema with the data to help with this analysis.
Database design:
The database schema consists of the following Fact Table
SONGPLAYS - stores records in log data associated with song plays i.e. records with page NextSong having the following column definitions
songplay_id serial PRIMARY KEY
start_time timestamp NOT NULL REFERENCES time(start_time)
user_id text NOT NULL REFERENCES users(user_id)
level varchar
song_id text REFERENCES songs(song_id)
artist_id text REFERENCES artists(artist_id)
session_id text
location varchar
user_agent varchar
The database schema consists of the following Dimension Tables
USERS - stores users in the app that can access the songs and having the following column definitions:
user_id text PRIMARY KEY
first_name varchar
last_name varchar
gender varchar
level varchar
SONGS -stores songs in music database and having the following column definitions:
song_id text PRIMARY KEY
title varchar
artist_id text
year int
duration numeric
ARTISTS - storing artists in music database containing the following column definitions: artist_id text PRIMARY KEY name varchar NOT NULL location varchar latitude float longitude float
TIME - storing timestamps of records in songplays broken down into specific time units like weekday, year, month, week, hour, day, time etc, containing the following column definitions: start_time timestamp PRIMARY KEY hour int day int week int month int year int weekday varchar
The following ERD Diagram explains the star schema relationships between the Fact and Dimension tables.
ETL Process: This section describes the processing of the logs and creating different tables so that analytical queries can be run on them. It also describes, which directories has what kind of data and how are you extracting and transforming it.
The following 2 datasets located at the following directory structure are loaded into the database.
For Eg:
song_data/A/B/C/TRABCEI128F424C983.json
The song dataset"SONGS DATA" is a subset of real data from the Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
log_data/2018/11/2018-11-12-events.json
The log dataset consists of log files in JSON format generated by this event simulator (https://github.com/Interana/eventsim) based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The ETL program consists of 3 functions and a main function that triggers the 3 functions in order as follows:
1. process_song_file
Using a pandas dataframe the song data files are read into a dataframe one by one. Then the artist and song data are extracted from the dataframe and inserted into the tables (artists, songs) file by file.
2. process_log_file
Using a pandas dataframe the log files are read into a dataframe one by one. Then the timestamp is split into time elements like weekday, year, month, week, hour, day, time etc. User and time data are loaded into the USERS and TIME tables through the dataframe record iteration loop.
Joining the songs and artists table through sql joins the data populated
3. process_data
Reads data from the path specified to the function and triggering the function passed in as parameters. After getting either log files or data files the specific function passed to this function for eg., either process song file or process log file is executed.
4. function main
Manages the connection to the database schema calls the above process data function twice once for processing song file data and another for log file data consecutively closes the database connection.
Project Repository files:
1: create_tables.py contains function calls to create the database schemas and calls to table functions which are defined in the python file sql_queries.py. 2: sql_queries contain the actual table definition functions 3: etl.py contains the python code to execute the data pipeline
How To Run the Project:
The following python scripts need to be run in order from a terminal emulator
- create_tables.py ($ python sql_queries.py)
- etl.py ($ python etl.py
The create_tables.py is the master program containing the funtion main and has the program logic to create the database and tables using DDL and Query definitions sql_queries.py file
The etl.py file contains the program logic and python code to execute the data pipeline that transfers data in json files to the star schema created earlier.
SQL queries can then be created and tested in the file test.ipynb for analytical queries that need to be answered.