Sparkify

Origin
Authors:
- Heber Trujillo heber.trj.urt@gmail.com
- Montserrat Navarro montserrat.nvro.lpz@gmail.com
Date of last README.md update: 18.06.2022

Project Overview

Motivation

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project aims to show Python capabilities for data modeling and ETL pipelines design using Postgres.

Mainly, this project will focus on:

Define fact and dimension tables under the star schema optimized for OLAP.
ETL pipeline design to transfer data from two local directories to a Postgres Data Base.

How to Run Scripts

Prerequisites

PyEnv
Poetry

PyEnv installation in macOS

In the terminal:

Update Homebrew and install it:
```
brew update
brew install pyenv
```

Set up the shell environment for PyEnv for ZSH:

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrc

Install python 3.8.6 using PyEnv
```
pyenv install 3.8.6
pyenv global 3.8.6
```

Poetry installation

Install poetry using the following command:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Add Poetry's bin directory to PATH environment variable.
```
source $HOME/.poetry/env
```

Set virtual env in project root.

poetry config virtualenvs.in-project true

Dependencies Installation

Create python environment with all dependencies:
```
poetry install --no-root
```
Activate python environment:
```
source .venv/bin/activate
```

Scripts

The project is structured as follows:

sql_queries.py: A Script containing all the hardcode queries; future versions will refactor this using an ORM.
create_tables.py: This drops and creates the DDBB Schema shown in the following section; future versions will use alembic.
etl.py: Extract, process, and load the data from the ./data directory to the Sparkify Data Base.

Hence, to process the JSON files, you have to run the following comands at the terminal:

python .\create_tables.py
python .\etl.py

Data

The available data consists of:

Song data: A subset of actual data from the Million Song Dataset. Each file is in JSON format and contains metadata
about a song and the artist.
Log Dataset: log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

.\data\ tree

DataBase Schema

We use the star schema optimized to analyze which songs users are listening to, where we have:

Fact Table
- songplays - records in log data associated with song plays i.e. records with page NextSong.
Dimension Tables
- users - users in the app user_id, first_name, last_name, gender, level
- songs - songs in music database song_id, title, artist_id, year, duration
- artists - artists in music database.
- time - timestamps of records in songplays broken down into specific units.

Contributing to Sparkify

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

New Features Guidelines

All new features must reside in their own branch, which must be pushed to the remote repository. However, instead of branching from the master branch, feature branches use the develop branch as their parent branch. When a feature is finished, it is merged back into develop branch. Functions should never interact directly with the master branch.

Instructions

Check out to develop

git checkout develop

Pull latest changes from remote

git pull

Double check that you have all the dependencies installed in your virtual env from the new pyproject.toml

poetry install

Create your branch using the following notation feat/{your-name}, please note that {your-name} must be concise, like:

git checkout -b feat/pre-commit

Push your newly created branch to remote repository

git push --set-upstream origin feat/{your-name}

Regular commit your changes using the following logic
- Updating existing code:
```
 git commit -m "[UPD] your message here"
```
- New modules, classes, functions, etc:
```
 git commit -m "[NEW] your message here"
```

Once you have the feature ready and tested, open a Pull Request to merge your branch into develop.

Bug Fixing Guidelines

Fix, or hotfix branches are used to quickly fix production releases. They are very similar to feature branches, except that they use the master branch as their parent branch and not the develop branch. When the fix is done, it should be merged with master and develop branches, and master should be tagged with an updated version number.

Instructions

Check out to develop

git checkout develop

Pull latest changes from remote

git pull

Double check that you have all the dependencies installed in your virtual env from the new pyproject.toml

poetry install

Create your branch using the following notation fix/{your-name}, please note that {your-name} must be concise, like:

git checkout -b fix/logging-bug

Push your newly created branch to remote repository

git push --set-upstream origin fix/{your-name}

Regular commit your changes using the following logic
- Updating existing code:
```
 git commit -m "[UPD] your message here"
```
- New modules, classes, functions, etc:
```
 git commit -m "[NEW] your message here"
```

Once you have the feature ready and tested, open a Pull Request to merge your branch into develop.

HeberTU/Sparkify

Sparkify

Project Overview

Motivation

How to Run Scripts

Prerequisites

PyEnv installation in macOS

Poetry installation

Dependencies Installation

Scripts

Data

DataBase Schema

Contributing to Sparkify

New Features Guidelines

Instructions

Bug Fixing Guidelines

Instructions