This project supports the Analytics team at Sparkify, a recent startup providing a web and mobile music streaming app. The team has been collecting data about their users (the ones who gave explicit consent according to GDPR regulations). Such data resides in a directory of JSON logs on user activity of the app and metadata of the songs (as used by the app). The aim of the project is to easily extract data from the JSON files, transform, and load (ETL) them into an appropriate database suitable to obtain useful insights, e.g. which songs Sparkify's users listen to.
The database is not normalized because it must be ready for analytical processing. There is no need to modify records quickly but instead to execute queries as fast as possible without many aggregations (or without aggregations at all). The following star schema describes sparkifydb.
As it can be seen in the Figure, there is one fact table called songplays and four dimension tables called respectively artists, songs, time, and users. The star schema is self-explanatory. At the center of the schema ('the star') we can find the fact table, which is a derived table describing song, artist, time and user information all together. If there is a need of further details, foreign keys (FK) allow to retrieve, e.g., user's first_name. The key of songplays, i.e. songplay_id, is a surrogate key created for simplicity of indexing, so it does not naturally arises from the context. It is, in fact, autogenerated by the DBMS.
There are issues regarding perfomance and extensibility tradeoffs who I was not able to solve without further discussion with the team. Since the team will not be available for a while, I had to make some assumptions. I decided to choose extensibility and better code comprehension over performance.
Pandas iterrows is one the slowest part of the pipeline. Using to_records would be much faster but it comes with a cost: losing the comprehesion and flexibility. Accessing records with indexes instead of named columns may cause unintended effects in the future, e.g. columns order changes.
Logs could be ingested directly with COPY BULK INSERTs (See COPY and copy_from. However, this would move the ETL pipeline (partially) from pure Python to SQL. COPY, in fact, does not allow to modify data while copying from file, e.g. timestamp conversions, SELECTs. Therefore, SQL constraints (like FK, PK, etc.) must be executed as ALTER queries after populating the tables AND cleaning them, e.g. removing duplicates. The only SELECT (song_select in sql_queries.py) could be improved by creating a JOIN table between artists and songs and then creating an index based on both artist_id and song_id.
The project is made of two Jupyter Notebooks and three Python scripts, along with a data directory and a Makefile to make it easier to run. The following tree shows an except of the project dirs.
|-- data
| |-- log_data
| | `-- 2018
| | `-- 11
| | |-- 2018-11-01-events.json
| | |-- 2018-11-02-events.json
| | |-- 2018-11-03-events.json
| | ...
| `-- song_data
| `-- A
| |-- A
| | |-- A
| | | |-- TRAAAAW128F429D538.json
| | | |-- TRAAABD128F429CF47.json
| | | |-- TRAAADZ128F9348C2E.json
| | | ...
| | |-- B
| | | |-- TRAABCL128F4286650.json
| | | |-- TRAABDL12903CAABBA.json
| | | |-- TRAABJL12903CDCF1A.json
| | ...
| | `-- C
| | |-- TRAACCG128F92E8A55.json
| | |-- TRAACER128F4290F96.json
| | |-- TRAACFV128F935E50B.json
| | ...
| `-- B
| |-- A
| | |-- TRABACN128F425B784.json
| | |-- TRABAFJ128F42AF24E.json
| | |-- TRABAFP128F931E9A1.json
| | ...
| |-- B
| | |-- TRABBAM128F429D223.json
| | |-- TRABBBV128F42967D7.json
| | |-- TRABBJE12903CDB442.json
| ...
| `-- C
| |-- TRABCAJ12903CDFCC2.json
| |-- TRABCEC128F426456E.json
| |-- TRABCEI128F424C983.json
| ...
|-- diagrams
| `-- sparkifydb-starschema.png
|-- Makefile
|-- README.md
|-- create_tables.py
|-- etl.ipynb
|-- etl.py
|-- requirements.txt
|-- sql_queries.py
`-- test.ipynb
The data dir contains the JSON files collected by the team split into song_data and log_data. Each song can be reached by the first three letters of its ID. A song JSON file, e.g. data/song_data/A/A/A/TRAAAEF128F4273421.json, looks like the following listing:
{
"num_songs": 1,
"artist_id": "AR7G5I41187FB4CE6C",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "London, England",
"artist_name": "Adam Ant",
"song_id": "SONHOTT12A8C13493C",
"title": "Something Girls",
"duration": 233.40363,
"year": 1982
}
It is possible to find null values of latitude and longitude, as shown above.
A log, e.g. excerpt of data/log_data/2018/11/2018-11-04-events.json, instead, contains the following information:
{
"artist": null,
"auth": "Logged In",
"firstName": "Theodore",
"gender": "M",
"itemInSession": 0,
"lastName": "Smith",
"length": null,
"level": "free",
"location": "Houston-The Woodlands-Sugar Land, TX",
"method": "GET",
"page": "Home",
"registration": 1540306145796.0,
"sessionId": 154,
"song": null,
"status": 200,
"ts": 1541290555796,
"userAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko\/20100101 Firefox\/31.0",
"userId": "52"
}
- sql_queries.py contains all the queries to create, drop, and populate the tables;
- create_tables.py allows for the creation of the database with clean (empty) tables;
- etl.py implements the ETL pipeline to extract the data from JSON files and populate sparkifydb.
- etl.ipynb is an exploratory notebook that led to etl.py
- test.ipynb is the notebook to test the queries and manually check the content of the database
The project assumes that Postgres is installed and a user "student" with password "student" exists as well as a default database "student".
The easiest way to run the project is using the Makefile.
To create a Python virtualenv do:
make setup
Then, install all the requirements with:
make install
Finally, you run:
make etl
In case you want to test the notebooks for integrity, you can run:
make test
Python code has been linted with pylint and formatted with black using:
make lint && make format
To rebuild a clean database run:
make create
A Makefile can be run on any Unix-like systems thanks to GNU Make. If you do not have Make on Windows you can use chocolatey and install it. If you do not wish to install Make, you can run the following commands:
To set a virtualenv:
python3 -m venv .env/sparkifydb
To install the required libraries:
python3 -m pip install --upgrade pip && python3 -m pip install -r requirements.txt
Create a clean database:
python3 create_tables.py
Run the ETL pipeline:
python3 etl.py