GBQ PIPELINE WITH AIRFLOW DOCKER

INTRO

In this project, Airflow will be use to build a pipeline that leverage public datasets on Bigquery, update aggregated table on a daily basis that feed into a dashboard on Data Studio

SETUP

Prerequisite

  • Docker desktop
  • BigQuery account (sandbox)

BigQuery

  • This project leverage 2 public dataset of bigquery: bigquery-public-data.hacker_news and githubarchive.day
  • For billing, we can use sandbox account with 10GB storage, and 1TB query data free of charge monthly

Docker

  • Docker compose filepath: ./docker-compose.yml
  • Airflow image: apache/airflow:2.0.1 (with Flower off, default examples off)
  • Redis image: redis:latest
  • Postgre image: postgres:13

AIRFLOW

Dag design

  • Dag filepath: ./dags/gbq_pipeline.py

Running

> docker-compose up airflow-init -d
> docker-compose up -d

  • Tables are created in GBQ project, and the final join table github_hackernews_join also get data populated

DASHBOARD