In this project, we create a database schema in Postgres and build an ETL pipeline using Python.
A startup collects songs and user activity on their music streaming app. The analytics team wants to know what songs users are listening to. The data resides in a directory of JSON logs on user activity, as well as a directory with JSON metadata on the songs in their app.
The first dataset (in folder data) is a subset of real data from the Million Song Dataset.
- Set environments variables:
PGHOSTADDR with the IP address of the Postgres host, and PGPASSWORD with the password
- Clone the repo:
https://github.com/mhaywardhill/Song-Play-Analysis.git
- Setup the Python virtual environment:
conda create -n sparkify python=3.6
conda activate sparkify
pip install -r requirements.txt
- Create database sparkifdb and database schema:
python ./create_tables.py
- Run the ETL:
python ./etl.py
sql_queries.py: Contains SQL queries for dropping and creating the fact and dimension tables. Also, it contains the SQL queries to load the dimensions.
create_tables.py: Contains code for creating the sparkifydb database, and the database schema (the tables).
etl.py: Runs the Python ETL.
Manual_ETL.ipynb: Jupyter notebook to manually run the ETL step-by-step.