A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis, and bring you on the project. Your role is to create a database schema and ETL pipeline for this analysis. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
In this project, you'll apply what you've learned on data modeling with Postgres and build an ETL pipeline using Python. To complete the project, you will need to define fact and dimension tables for a star schema for a particular analytic focus, and write an ETL pipeline that transfers data from files in two local directories i nto these tables in Postgres using Python and SQL.
The data model I've implemented is a star model. It is the typical schema for a Data Warehouse. The tables are:
Table songplays
COLUMN | TYPE | CONSTRAINT |
---|---|---|
songplay_id | SERIAL | PRIMARY KEY |
start_time | bigint | NOT NULL |
user_id | int | NOT NULL |
level | varchar | |
song_id | varchar | |
artist_id | varchar | |
session_id | int | |
location | text | |
user_agent | text |
The songplay_id field is the primary key and it is an auto-incremental value.
The query to insert data on this table is:
INSERT INTO songplays (start_time, user_id, level,song_id, artist_id, session_id, location, user_agent) \ VALUES ( %s, %s, %s, %s, %s, %s, %s, %s)
I've create one table for each dimension of the Fact Table
Table users
COLUMN | TYPE | CONSTRAINT |
---|---|---|
user_id | int | PRIMARY KEY |
first_name | varchar | |
last_name | varchar | |
gender | varchar(1) | |
level | varchar |
The query to insert data on this table is:
INSERT INTO users (user_id, first_name, last_name, gender, level) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (user_id) DO UPDATE SET first_name = EXCLUDED.first_name, last_name = EXCLUDED.last_name, gender = EXCLUDED.gender, level = EXCLUDED.level
An alternative is change the target of ON CONFLICT. I've supposed the info about users don't change. But it could be probably a better way DO UPDATE action in order to get the latest info about users with a reduction of performance for the UPDATE. DO NOTHING is faster.
Table songs
COLUMN | TYPE | CONSTRAINT |
---|---|---|
song_id | varchar | PRIMARY KEY |
title | text | |
artist_id | varchar | |
year | int | |
duration | numeric |
The query to insert data on this table is:
INSERT INTO songs (song_id, title, artist_id, year, duration) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (song_id) DO UPDATE SET title = EXCLUDED.title, artist_id = EXCLUDED.artist_id, year = EXCLUDED.year, duration = EXCLUDED.duration
Table artists
COLUMN | TYPE | CONSTRAINT |
---|---|---|
artist_id | varchar | PRIMARY KEY |
name | varchar | |
location | text | |
latitude | decimal | |
longitude | decimal |
The query to insert data on this table is:
INSERT INTO artists (artist_id, name, location, latitude, longitude) VALUES (%s, %s, %s, %s, %s) ON CONFLICT (artist_id) DO UPDATE SET name = EXCLUDED.name, location = EXCLUDED.location, latitude = EXCLUDED.latitude, longitude = EXCLUDED.longitude
Table time
COLUMN | TYPE | CONSTRAINT |
---|---|---|
start_time | bigint | PRIMARY KEY |
hour | int | |
day | int | |
week | int | |
month | int | |
year | int | |
weekday | varchar |
The query to insert data on this table is:
INSERT INTO time (start_time, hour, day, week, month, year, weekday) VALUES (%s, %s, %s, %s, %s, %s, %s) ON CONFLICT (start_time) DO NOTHING
The ETL is in the file etl.py and is divided in the next sections:
- Connect to the database.
- Process song files.
- Insert song data into songs table.
- Insert artist data into artists table.
- Process log_files.
- Insert ts (unix timestamp) in time table.
- from the field ts we can extract year, day, hour, week, month and day of the week.
- Insert user info in users table.
- Insert songpplay records into songplays table. In this case we need an additional select to get the artist_id and the artist_id. This is very important for the star schema will successful. I've improved this query using an additional INDEX in song table for the artist_id field to make the JOIN with artists table.
- Insert ts (unix timestamp) in time table.
- Disconnect and finish.
This file contains all the queries to the database.
In this file are:
- All the CREATE sentences for all the tables.
- All the INSERT sentences for all the tables.
- The select to get artist_id and song_id in order to fill the songplays table.
Here there are some examples of querying this data model
- Get the users that have most activity in songlist:
select U.first_name, U.last_name, count(1) as num_plays from songplays S join users U on U.user_id = S.user_id group by U.first_name, U.last_name order by num_plays desc
- Get the list of songs and artists most listened:
select U.title as song, A.name as artist, count(1) as num_plays from songplays S join songs U on U.song_id = S.song_id join artists A on A.artist_id = S.artist_id group by U.title , A.name order by num_plays desc
For local execution you can use PostgreeSQL docker:
- Download docker image:
docker pull postgres
- Execute docker:
docker run --name postgres --rm -p 5432:5432 -e -d postgres
- Connect string:
"host=127.0.0.1 dbname=postgres user=postgres password=postgres port=5432"