/zoomcamp-data-engineering

This repository documents my journey through the diverse and exciting field of data engineering. It serves as a curated collection of resources, notes, projects, and practical examples that have guided me in building a solid foundation in data engineering concepts, tools, and best practices.

Primary LanguageJupyter Notebook

zoomcamp-data-engineering

This repository documents my journey through the diverse and exciting field of data engineering. It serves as a curated collection of resources, notes, projects, and practical examples that have guided me in building a solid foundation in data engineering concepts, tools, and best practices.

Syllabus

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

Week 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Introduction to Prefect
  • ETL with GCP & Prefect
  • Parametrizing workflows
  • Prefect Cloud and additional resources
  • Homework

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your project
  • Week 9: reviewing your peers

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming