Project: Sparkify Data Modeling with Postgres
Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
Resolution
In order to resolve such a problem I created an ETL pipeline for data analysis. Sparkify provided datasources, thus ETL reads data from datasources and formats them in an specific way before saving the modeled data into postgres table.
Data Modeling
Data were modeled into a Star schema with the following fact and dimension tables:
Fact Table
songplay
songplay_id start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimensions Tables
users
user_id, first_name, last_name, gender, level
songs
song_id, title, artist_id, year, duration
artists
artist_id, name, location, latitude, longitude
time
start_time, hour, day, week, month, year, weekday
Project Structure
├── README.md
├── create_tables.py
├── data
│ ├── log_data
│ │ └── 2018
│ │ └── 11
│ └── song_data
│ └── A
│ ├── A
│ │ ├── A
│ │ ├── B
│ │ └── C
│ └── B
│ ├── A
│ ├── B
│ └── C
├── etl.ipynb
├── etl.py
├── sql_queries.py
└── test.ipynb
14 directories, 6 files
create_tables.py
should be executed in order to create the start database table shema. This file uses sql queries written within the sql_queries.py
file.
etl.ipynb
is a jupyter notebook with the project guidethrough. All code within this file is also within etl.py
-- clearly in a more organised way.
etl.py
is the etl pipeline itself. It reads data from data/
and saves them in the postgres database tables.
sql_queries.py
contains sql queries to create, insert, drop and select from the database.
test.ipynb
should be run to asure you the data has been saved accordingly.
Run the Project
Once you clone this repository you should proceed as follows:
Run create_tables.py file:
python create_tables.py
Run etl.py file:
python etl.py
And last but not least, run jupyter notebook to be able to run test.ipynb. It will make sure tables were created and data saved in your database.
Constrains
You need to run a postgres database locally. Docker is a great tool to run your local postgres db.
Greetings
Made with Love in Malta
Wagner de Lima