/etl-sparkify-postgres

This repository is based on project #1 from Data Engineering course on Udacity.

Primary LanguageJupyter Notebook

PURPOSE

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They'd like to create a Postgres database with tables designed to optimize queries on song play analysis, and here comes the repository.

OBJECTIVES

There are two main goals for this repo.

  1. Define fact and dimension tables for a star schema.
  2. Establish an ETL pipeline to realize the schema above and serve the analytics team's analysis purpose. The ETL pipeline will transfer data from files in two local directories into these tables in Postgres using Python and SQL.

DATA SOURCE DESCRIPTION

There are two local directories containing two diferent datasets as follows:

  1. Song Dataset

This dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{
"num_songs": 1, 
"artist_id": "ARJIE2Y1187B994AB7", 
"artist_latitude": null, 
"artist_longitude": null, 
"artist_location": "", 
"artist_name": "Line Renaud", 
"song_id": "SOUPIRU12A6D4FA1E1", 
"title": "Der Kleine Dompfaff", 
"duration": 152.92036, 
"year": 0
}
  1. Log Dataset This dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

You can use pandas.read_json(filepath, lines=True) method to load in these files to scrutinize their contents.

DATABASE SCHEMA DESIGN

The repo designs a star schema optimized for the queries on the song play analysis. This includes the FIVE tables (ONE for the fact table and the other FOUR for dimension tables). These tables ingest corresponding information from those JSON files depicted above. The following describes the columns of each table.

- Fact Table

songplays
This table are records in log data associated with song plays i.e. records with page NextSong - songplay_id - start_time - user_id - level - song_id - artist_id - session_id - location - user_agent

- Dimension Tables

  1. users
    users table records users information in the app. It has columns as follows,

    • user_id
    • first_name
    • last_name
    • gender
    • level
  2. songs
    songs table records basic information of songs in music database. It has columns as follows,

    • song_id
    • title
    • artist_id
    • year
    • duration
  3. artists
    artists table records information about the artist in music database. It has columns as follows,

    • artist_id
    • name
    • location
    • latitude
    • longitude
  4. time
    time table records the timestamps of records in songplays broken down into specific units. It has columns as follows,

    • start_time
    • hour
    • day
    • week
    • month
    • year
    • weekday

REPOSITORY PROFILE

In addition to this README.md and the data files, this repository also includes another five files:

  1. test.ipynb displays the first few rows of each table to let you check the database.

  2. create_tables.py drops and creates the tables. Run this file can reset the tables each time before runing the ETL scripts (see below).

  3. etl.ipynb reads and processes a single file from song_data and log_data and loads the data into those tables in the database respectively. This notebook also contains detailed steps on building the ETL process for each of the tables. You can regard this notebook as a preparation step or experiment step only on a single file before reading and processing all the file in batch in the etl.py file.

  4. etl.py reads and processes the whole files from song_data and log_data and loads them into the five tables. The coding logics is based on the etl.ipynb notebook.

  5. sql_queries.py contains all the sql queries (including drop and create tables), and is imported into the last three files above.

HOW TO START

Below are steps you can follow to walk through this repository:

STEP ONE: Creating Databases and Tables

  1. Run create_tables.py to create the database and tables.
  2. Run test.ipynb to confirm the creation of the database and tables with the correpsponding columns based on the schema. Click "Restart kernel" to close the connection before going on with the ETL process.

STEP TWO: Initiate ETL Processes

Instructions in the etl.ipynb notebook can guide you through how the ETL process is developed for each table. At the end of each table section, or at the end of the notebook, you can run test.ipynb to confirm that records were successfully inserted into each table. Be sure to rerun create_tables.py to reset your tables before each time you run this notebook.

STEP THREE: Building ETL Pipeline

  1. Run the etl.py file is where you read and process the entire dataset -- namely the PIPELINE. The principles for the processes behind are based on the etl.ipynb notebook. (Be sure to run create_tables.py before running etl.py to reset (or initialize) the tables.)
  2. Run test.ipynb to confirm the records were successfully inserted into each table.