This is project is intended to help Sparkify music startup to build a database in postgres to query their data. Until now their dataresides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The code in this repo are ETLs to load songplays, users, songs, artists and time tables into a database in Postgres.
- Load the song and log data into each of the folders in data.
- Run
creates_tables.py
to drop all existing tables in the sparkify database, and create new ones. - Run
etl.py
to extract, transform and load into sparkify database the songplays, users, songs, artists and time tables from song and log json files. - Run
test.ipynb
to verify that etl ran successfully.
data
: Contains the song data and log data, which are json files.creates_tables.py
: Drops and creates tables in Postgre sparkify databse.etl.ipynb
: Notebook with all the prototype code to build the ETL process.etl.py
: Runs the ETLs to load data into sparkify databse. It creates the following tables:- songplays
- users
- songs
- artists
- time
sql_queries.py
: Script with all the drop, create and insert SQL commands.test.ipynb
: Script that tests the existence and content of each of the tables in the database.