Project: Data Pipelines

Data Pipelines with Airflow / Data Engineer Nanodegree


Author: Jakub Pitera

This projects demonstrates building a data pipeline with Apache Airflow . It builds a dag that orchestrates ETL pipeline from S3 to Redshift database.

Pipeline steps:

  1. Create empty tables on redshift
  2. Copy staging tables from S3 to Redshift
  3. Transform and insert data into fact and dimension tables
  4. Run quality checks

Files description:

  • dags/ - main script, defines dag, creates task with appropriate operators and parameters and schedules pipeline
  • create_tables.sql - sql queries for dropping and creating tables, includes data definition
  • plugins/helpers/ - additional queries used by airflow operators, used for inserting into tables
  • plugins/operators/ - definition of custom airflow operator for staging from S3 to redshift
  • plugins/operators/ - definition of custom airflow operator inserting data from staging table into fact table on redshift
  • plugins/operators/ - definition of custom airflow operator inserting data from staging table into dim tables on redshift
  • plugins/operators/ - definition of custom airflow operator for quality check on data
  • - documentation