A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They'd like to create a Postgres database with tables designed to optimize queries on song play analysis, and here comes the repository.
There are two main goals for this repo.
- Define fact and dimension tables for a star schema.
- Establish an ETL pipeline to realize the schema above and serve the analytics team's analysis purpose. The ETL pipeline will transfer data from files in two local directories into these tables in Postgres using Python and SQL.
There are two local directories containing two diferent datasets as follows:
- Song Dataset
This dataset is a subset of real data from the Million Song Dataset. Each file is in JSON
format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json
, looks like.
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
- Log Dataset
This dataset consists of log files in
JSON
format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
You can use pandas.read_json(filepath, lines=True)
method to load in these files to scrutinize their contents.
The repo designs a star schema optimized for the queries on the song play analysis. This includes the FIVE tables (ONE for the fact table and the other FOUR for dimension tables). These tables ingest corresponding information from those JSON
files depicted above. The following describes the columns of each table.
songplays
This table are records in log data associated with song plays i.e. records with page NextSong
- songplay_id
- start_time
- user_id
- level
- song_id
- artist_id
- session_id
- location
- user_agent
-
users
users
table records users information in the app. It has columns as follows,- user_id
- first_name
- last_name
- gender
- level
-
songs
songs
table records basic information of songs in music database. It has columns as follows,- song_id
- title
- artist_id
- year
- duration
-
artists
artists
table records information about the artist in music database. It has columns as follows,- artist_id
- name
- location
- latitude
- longitude
-
time
time
table records the timestamps of records in songplays broken down into specific units. It has columns as follows,- start_time
- hour
- day
- week
- month
- year
- weekday
In addition to this README.md
and the data files, this repository also includes another five files:
-
test.ipynb
displays the first few rows of each table to let you check the database. -
create_tables.py
drops and creates the tables. Run this file can reset the tables each time before runing the ETL scripts (see below). -
etl.ipynb
reads and processes a single file from song_data and log_data and loads the data into those tables in the database respectively. This notebook also contains detailed steps on building the ETL process for each of the tables. You can regard this notebook as a preparation step or experiment step only on a single file before reading and processing all the file in batch in theetl.py
file. -
etl.py
reads and processes the whole files from song_data and log_data and loads them into the five tables. The coding logics is based on theetl.ipynb
notebook. -
sql_queries.py
contains all the sql queries (including drop and create tables), and is imported into the last three files above.
Below are steps you can follow to walk through this repository:
- Run
create_tables.py
to create the database and tables. - Run
test.ipynb
to confirm the creation of the database and tables with the correpsponding columns based on the schema. Click "Restart kernel" to close the connection before going on with the ETL process.
Instructions in the etl.ipynb
notebook can guide you through how the ETL process is developed for each table. At the end of each table section, or at the end of the notebook, you can run test.ipynb
to confirm that records were successfully inserted into each table. Be sure to rerun create_tables.py to reset your tables before each time you run this notebook.
- Run the
etl.py
file is where you read and process the entire dataset -- namely the PIPELINE. The principles for the processes behind are based on theetl.ipynb
notebook. (Be sure to runcreate_tables.py
before runningetl.py
to reset (or initialize) the tables.) - Run
test.ipynb
to confirm the records were successfully inserted into each table.