Airflow Spark

The idea behind this project was to integrate my aquired knowledge in Airflow, Postgres, pySpark, Metabase and Docker into a single end to end simple project

For the architecture behind this I cloned Cordon Thiago's Github repo, but modified it towards my goal

- removed jupyter container
- working with just 2 Spark worker nodes
- added Metabase container
- added shared volume for spark-warehouse

Architecture components

Data Pipeline

1. Downloads taxis data from NYC official repository (download DAG)
2. Consolidates parquet files into a pySpark temp view
3. Writes results of simple queries to the view into PG tables
4. Metabase dashboard from the PG agg tables

Steps to run this project locally

Build airflow-spark driver image

Inside the taxis-project/docker/docker-airflow

$ docker build --rm --force-rm -t docker-airflow-spark:1.10.7_3.1.2 .

Start containers

Navigate to airflow-spark/docker and:

$ docker-compose up -d

Note: when running the docker-compose for the first time, the images postgres:9.6, bitnami/spark:3.1.2 and metabase/metabase will be downloaded before the containers started.