/Sparkify

Data Modeling with Postgres

Primary LanguagePython

Sparkify

Code style: black

Project Overview

Motivation

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project aims to show Python capabilities for data modeling and ETL pipelines design using Postgres.

Mainly, this project will focus on:

  • Define fact and dimension tables under the star schema optimized for OLAP.
  • ETL pipeline design to transfer data from two local directories to a Postgres Data Base.

How to Run Scripts

Prerequisites

  1. PyEnv
  2. Poetry

PyEnv installation in macOS

In the terminal:

  1. Update Homebrew and install it:

    brew update
    brew install pyenv
  2. Set up the shell environment for PyEnv for ZSH:

    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
    echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
    echo 'eval "$(pyenv init -)"' >> ~/.zshrc
  3. Install python 3.8.6 using PyEnv

    pyenv install 3.8.6
    pyenv global 3.8.6

Poetry installation

  1. Install poetry using the following command:

    curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
  2. Add Poetry's bin directory to PATH environment variable.

    source $HOME/.poetry/env
  3. Set virtual env in project root.

    poetry config virtualenvs.in-project true

Dependencies Installation

  1. Create python environment with all dependencies:

    poetry install --no-root
  2. Activate python environment:

    source .venv/bin/activate

Scripts

The project is structured as follows:

  • sql_queries.py: A Script containing all the hardcode queries; future versions will refactor this using an ORM.

  • create_tables.py: This drops and creates the DDBB Schema shown in the following section; future versions will use alembic.

  • etl.py: Extract, process, and load the data from the ./data directory to the Sparkify Data Base.

Hence, to process the JSON files, you have to run the following comands at the terminal:

python .\create_tables.py
python .\etl.py

Data

The available data consists of:

  • Song data: A subset of actual data from the Million Song Dataset. Each file is in JSON format and contains metadata

  • about a song and the artist.

  • Log Dataset: log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

.\data\ tree

img.png

DataBase Schema

We use the star schema optimized to analyze which songs users are listening to, where we have:

  • Fact Table

    • songplays - records in log data associated with song plays i.e. records with page NextSong.
  • Dimension Tables

    • users - users in the app user_id, first_name, last_name, gender, level
    • songs - songs in music database song_id, title, artist_id, year, duration
    • artists - artists in music database.
    • time - timestamps of records in songplays broken down into specific units.

img.png

Contributing to Sparkify

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

New Features Guidelines

All new features must reside in their own branch, which must be pushed to the remote repository. However, instead of branching from the master branch, feature branches use the develop branch as their parent branch. When a feature is finished, it is merged back into develop branch. Functions should never interact directly with the master branch.

Instructions

  1. Check out to develop
git checkout develop
  1. Pull latest changes from remote
git pull
  1. Double check that you have all the dependencies installed in your virtual env from the new pyproject.toml
poetry install
  1. Create your branch using the following notation feat/{your-name}, please note that {your-name} must be concise, like:
git checkout -b feat/pre-commit
  1. Push your newly created branch to remote repository
git push --set-upstream origin feat/{your-name}
  1. Regular commit your changes using the following logic
    • Updating existing code:
     git commit -m "[UPD] your message here"
    • New modules, classes, functions, etc:
     git commit -m "[NEW] your message here"

Once you have the feature ready and tested, open a Pull Request to merge your branch into develop.

Bug Fixing Guidelines

Fix, or hotfix branches are used to quickly fix production releases. They are very similar to feature branches, except that they use the master branch as their parent branch and not the develop branch. When the fix is done, it should be merged with master and develop branches, and master should be tagged with an updated version number.

Instructions

  1. Check out to develop
git checkout develop
  1. Pull latest changes from remote
git pull
  1. Double check that you have all the dependencies installed in your virtual env from the new pyproject.toml
poetry install
  1. Create your branch using the following notation fix/{your-name}, please note that {your-name} must be concise, like:
git checkout -b fix/logging-bug
  1. Push your newly created branch to remote repository
git push --set-upstream origin fix/{your-name}
  1. Regular commit your changes using the following logic
    • Updating existing code:
     git commit -m "[UPD] your message here"
    • New modules, classes, functions, etc:
     git commit -m "[NEW] your message here"

Once you have the feature ready and tested, open a Pull Request to merge your branch into develop.