Data Pipelines with Airflow

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines. The tool Apache Airflow should be used to load and process data residing in S3 and to transfer it to data warehouse Redshift in the AWS.

DAGs

udac_example_dag.py

subdag.py

Configuration

Airflow

Variables

udac_example_dag.append: Possible values are True or False. If False the dimensions and facts tables will be emptied before inserting new rows

Connections

aws_credentials: AWS Credentials to Access Redshift Cluster and S3 Bucket

redshift: Connection to the Redshift Cluster

Pre-Requests

AWS account
Up and running Redshift Cluster

Running

clone this repository
run docker-compose up
go to http://localhost:8080/home
create aws_credentials and redshift connections (Admin->Connections)
activate an run DAG udac_example_dag

m3g4p0p/DataPipelines