This project creates a user activity data model for a music streaming app called Sparkify. The model is optimized to run queries for understanding what songs users are listening to. This model is created in Postgres database by definig facts and dimension tables following STAR schema methodology.
The ETL process imports data in JSON format present under song_data and log_data respectively.
The ETL scripts reads files containing data in JSON format having song and user activity data. After reading the files, the process inserts records in their respective tables as defined below.
songplays
records in log data associated with song plays
users
: users in the appsongs
: songs in music databaseartists
: artists in music databasetime
: timestamps of records in songplays broken down into specific units
There are 4 script files used in this project:
-
create_tables.py
drops and creates tables. -
etl.ipynb
reads and processes a single file from song_data and log_data and loads the data into the tables. This file runs as Jupyter notebook, that utilizes Pandas and psycopg2 postgres client library to read and insert data in tables. -
etl.py
reads and processes files from song_data and log_data and loads them into the tables. -
sql_queries.py
contains all sql queries to create and drop tables. This script is imported into above three files for creating and inserting records in the tables.
Python scripts can be run from terminal by executing python file_name.extension in terminal. For ex: to run create_tables.py type python create_tables.py
in terminal and enter.