Data Pipeline Project

A data pipeline that is dynamic and reusable, easily monitored, backfilled easily, and ensure data quality after ETL steps. Our source data (JSON logs detailing user activity in a music app and JSON metadata about songs listened to by users) resides in a S3 bucket and our destination is a data warehouse in Redshift.

- The main dag which will run all pipeline tasks

- Loads JSON files from S3 to Redshift

- Runs query that loads 'songplays' fact table

- Runs queries for loading the 'users', 'songs', 'artists', and 'time' dimension tables

- Tests data by checking the number of records. If number of records is zero, throws error

- SQL script that creates tables for the Redshift data warehouse

- SQL queries for the custom operators that load the fact and dimension tables

Configure AWS using AWS CLI command aws configure
- Enter access key and secret access key as prompted
- Set region to us-west-2
- If AWS CLI isn't installed follow the documentation
Create Redshift cluster on AWS
- Place cluster in same region as S3 bucket (in this case: us-west-2)
Use create_tables.sql to create tables in Redshift cluster
Run the start shell script for Airflow to access the Airflow UI
Configure connection to Redshift using endpoint of Redshift cluster
Turn on the DAG
- The dag should run automatically since a schedule interval is specified
Monitor the pipeline's progress using the Airflow UI
Verify that trasformed data passes data null checks
Be sure to turn off the DAG and delete the Redshift cluster when done