Project 1: Data modeling with Postgres

This Udacity Data Engineering nanodegree project creates a postgres database sparkifydb for a music app, Sparkify. The purpose of the database is to model song and log datasets (originaly stored in JSON format) with a star schema optimised for queries on song play analysis.

Schema design and ETL pipeline

The star schema has 1 fact table (songplays), and 4 dimension tables (users, songs, artists, time). DROP, CREATE, INSERT, and SELECT queries are defined in sql_queries.py. create_tables.py uses functions create_database, drop_tables, and create_tables to create the database sparkifydb and the required tables.

Extract, transform, load processes in etl.py populate the songs and artists tables with data derived from the JSON song files, data/song_data. Processed data derived from the JSON log files, data/log_data, is used to populate time and users tables. A SELECT query collects song and artist id from the songs and artists tables and combines this with log file derived data to populate the songplays fact table.

The create_tables.py pre-defined connection conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student") will now connect to the container.