This project contains the schema and ETL data pipeline code for a database warehouse for the music streaming app Sparkify.
The ETL and data warehouse has been built on AWS, with a PostgreSQL database and staging tables hosted on Amazon Redshift, pulling data from Amazon S3. The analytics tables have been arranged in a star schema to allow the Sparkify team to readily run queries to analyze user activity on their app, such as on what songs users are listening to. The scripts have been created in Python.
The platform hosting the data warehouse is AWS, specifically S3 for the source data and Redshift for the Postgres warehouse. The data warehouse is optimized for OLAP via the usage of a star schema. Both the DDL and ETL code are executed from Python scripts.
The project contains the following components:
├── README.md - This file.
├── create_tables.py # Python script with all methods necessary to recreate the data warehouse.
├── etl.py # Python script with all methods necessary to extract the trigger the execution of the ETL process.
├── dwh.cfg # Config file to be populated with credentials for accessing all cloud resources.
└── sql_queries.py # Python script defining the data warehouse schema and prepared/reusable queries.
The database contains the following fact table:
songplays
- user song plays
songplays
has foreign keys to the following (self-explanatory) dimension
tables:
users
songs
artists
time
You will need to update dwh.cfg
with the following:
[CLUSTER]
HOST=<your_host>
DB_NAME=<your_db_name>
DB_USER=<your_db_user>
DB_PASSWORD=<your_db_password>
DB_PORT=<your_db_port>
[IAM_ROLE]
ARN=<your_iam_role_arn>
[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'
To execute the ETL on an existing cluster from the command line, navigate to the root of the project and run the following commands
python3 create_tables.py
python3 etl.py