/dtc_de_zoomcamp

Primary LanguageJupyter Notebook

DTC Data Engineering ZoomCamp

Proposal Architecture on GCP

Alacritty Logo

Tech Stack

  • Goocle Cloud Platform
  • Terraform
  • Docker
  • SQL
  • Airflow
  • dbt
  • Spark
  • Kafka

Week 01

Intro and Prerequisites

  • Setting up the Environment
  • Google Cloud Account
    1. Docker
    2. Terraform
  • Running Postgres in Docker
  • Taking a look at the NY Taxi dataset
  • SQL refresher

Week 02

Ingestion and Orchestration

  • Data Lake
    1. What is a Data Lake
    2. ETL vs ELT
    3. Using GCS
  • Orchestration
    1. What is an Orchestration Pipeline
    2. Data Ingestion
    3. Introducing & Using Airflow
  • Demo
    1. Setting up Airflow with Docker
    2. Data Ingestion DAG
      • Extraction
      • Pre-processing (parquet, partitioning)
      • Loading
      • Exploration with Big Query
  • Best Practices

Week 03

Data Warehouse

  • What is Data Warehouse?
  • BigQuery?
    1. Partitioning and Clustering
    2. With Airflow
    3. Best Practices

Week 04

Analytics Engineering

  • What is dbt and how does it fit the tech stack?
  • Using dbt:
    1. Anatomy of a dbt model
    2. Seeds
    3. Jinja, Macros and test
    4. Documentation
    5. Packages
  • Build a dashboard in Google Data Studio

Week 05

Batch Processing

  • Spark internals
  • Broadcasting
  • Partitioning
  • Shuffling
  • Spark + Airflow
  • Apache Flink as alternative

Week 06

Stream Processing

  • Basics of Kafka
  • Consumer-Producer
  • Kafka Streams
  • Kafka Connect