zoomcamp-data-engineering

This repository documents my journey through the diverse and exciting field of data engineering. It serves as a curated collection of resources, notes, projects, and practical examples that have guided me in building a solid foundation in data engineering concepts, tools, and best practices.

Syllabus

Week 1: Introduction & Prerequisites

Course overview
Introduction to GCP
Docker and docker-compose
Running Postgres locally with Docker
Setting up infrastructure on GCP with Terraform
Preparing the environment for the course
Homework

Week 2: Workflow Orchestration

Data Lake
Workflow orchestration
Introduction to Prefect
ETL with GCP & Prefect
Parametrizing workflows
Prefect Cloud and additional resources
Homework

Week 3: Data Warehouse

Data Warehouse
BigQuery
Partitioning and clustering
BigQuery best practices
Internals of BigQuery
Integrating BigQuery with Airflow
BigQuery Machine Learning

Week 4: Analytics engineering

Basics of analytics engineering
dbt (data build tool)
BigQuery and dbt
Postgres and dbt
dbt models
Testing and documenting
Deployment to the cloud and locally
Visualizing the data with google data studio and metabase

Week 5: Batch processing

Batch processing
What is Spark
Spark Dataframes
Spark SQL
Internals: GroupBy and joins

Week 6: Streaming

Introduction to Kafka
Schemas (avro)
Kafka Streams
Kafka Connect and KSQL

Week 7, 8 & 9: Project

Putting everything we learned to practice

Week 7 and 8: working on your project
Week 9: reviewing your peers

Technologies

Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Prefect: Workflow Orchestration
dbt: Data Transformation
Spark: Distributed Processing
Kafka: Streaming

hamzajakouk/zoomcamp-data-engineering