Project Overview

In this project, we create a database schema in Postgres and build an ETL pipeline using Python.

A startup collects songs and user activity on their music streaming app. The analytics team wants to know what songs users are listening to. The data resides in a directory of JSON logs on user activity, as well as a directory with JSON metadata on the songs in their app.

The first dataset (in folder data) is a subset of real data from the Million Song Dataset.

Database Schema

Step 1: Setup the Environment

Set environments variables:

PGHOSTADDR with the IP address of the Postgres host, and PGPASSWORD with the password

Clone the repo:

https://github.com/mhaywardhill/Song-Play-Analysis.git

Setup the Python virtual environment:

      conda create -n sparkify python=3.6
      conda activate sparkify
      pip install -r requirements.txt

Step 2: Run the ETL

Create database sparkifdb and database schema:

python ./create_tables.py

Run the ETL:

python ./etl.py

Project Files

sql_queries.py: Contains SQL queries for dropping and creating the fact and dimension tables. Also, it contains the SQL queries to load the dimensions.

create_tables.py: Contains code for creating the sparkifydb database, and the database schema (the tables).

etl.py: Runs the Python ETL.

Manual_ETL.ipynb: Jupyter notebook to manually run the ETL step-by-step.