Project: Data Pipelines

Data Pipelines with Airflow / Data Engineer Nanodegree

Udacity

Author: Jakub Pitera

This projects demonstrates building a data pipeline with Apache Airflow . It builds a dag that orchestrates ETL pipeline from S3 to Redshift database.

Pipeline steps:

Create empty tables on redshift
Copy staging tables from S3 to Redshift
Transform and insert data into fact and dimension tables
Run quality checks

Files description:

dags/udac_example_dag.py - main script, defines dag, creates task with appropriate operators and parameters and schedules pipeline
create_tables.sql - sql queries for dropping and creating tables, includes data definition
plugins/helpers/sql_queries.py - additional queries used by airflow operators, used for inserting into tables
plugins/operators/stage_redshift.py - definition of custom airflow operator for staging from S3 to redshift
plugins/operators/load_fact.py - definition of custom airflow operator inserting data from staging table into fact table on redshift
plugins/operators/load_dimension.py - definition of custom airflow operator inserting data from staging table into dim tables on redshift
plugins/operators/data_quality.py - definition of custom airflow operator for quality check on data
README.md - documentation

siekiery/dend-data-pipelines-with-airflow-udacity

Project: Data Pipelines

Data Pipelines with Airflow / Data Engineer Nanodegree

Udacity

Author: Jakub Pitera

Pipeline steps:

Files description: