Data Engineering Zoomcamp

Register in DataTalks.Club's Slack
Join the #course-data-engineering channel
Join the course Telegram channel with announcements
The videos are published on DataTalks.Club's YouTube channel in the course playlist
Frequently asked technical questions

Syllabus

Module 1: Containerization and Infrastructure as Code
Module 2: Workflow Orchestration
Workshop 1: Data Ingestion
Module 3: Data Warehouse
Module 4: Analytics Engineering
Module 5: Batch processing
Module 6: Streaming
Workshop 2: Stream Processing with SQL
Project

Taking the course

2024 Cohort

Start: 15 January 2024 (Monday) at 17:00 CET
Registration link: https://airtable.com/shr6oVXeQvSI5HuWD
Cohort folder with homeworks and deadlines

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

Follow the suggested syllabus (see below) week by week
You don't need to fill in the registration form. Just start watching the videos and join Slack
Check FAQ if you have problems
If you can't find a solution to your problem in FAQ, ask for help in Slack

Syllabus

Note: NYC TLC changed the format of the data we use to parquet. In the course we still use the CSV files accessible here.

Module 1: Containerization and Infrastructure as Code

Course overview
Introduction to GCP
Docker and docker-compose
Running Postgres locally with Docker
Setting up infrastructure on GCP with Terraform
Preparing the environment for the course
Homework

More details

Module 2: Workflow Orchestration

Data Lake
Workflow orchestration
Workflow orchestration with Mage
Homework

More details

Module 3: Data Warehouse

Data Warehouse
BigQuery
Partitioning and clustering
BigQuery best practices
Internals of BigQuery
Integrating BigQuery with Airflow
BigQuery Machine Learning

More details

Module 4: Analytics engineering

Basics of analytics engineering
dbt (data build tool)
BigQuery and dbt
Postgres and dbt
dbt models
Testing and documenting
Deployment to the cloud and locally
Visualizing the data with google data studio and metabase

More details

Module 5: Batch processing

Batch processing
What is Spark
Spark Dataframes
Spark SQL
Internals: GroupBy and joins

More details

Module 6: Streaming

Introduction to Kafka
Schemas (avro)
Kafka Streams
Kafka Connect and KSQL

More details

Workshop 2: Stream Processing with SQL

More details

Project

Putting everything we learned to practice

Week 1 and 2: working on your project
Week 3: reviewing your peers

More details

Overview

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Past instructors:

Course UI

Alternatively, you can access this course using the provided UI app, the app provides a user-friendly interface for navigating through the course material.