Data Engineering Zoomcamp

This is my collection of notes and code from following the DataTalks Data Engineering Zoomcamp.

Deadlines

Module	Start Date	Homework Due	Weeks to complete	Videos	Duration	Notes
1. Introduction & Prerequisites	15 Jan	25 Jan	2	x9	2h 50m	📝
2. Workflow Orchestration	29 Jan	05 Feb	1	x11	1h 32m	📝
3. Data Warehouse	05 Feb	12 Feb	1	x6	1h 01m	📝
dlt workshop	05 Feb	15 Feb	1.5	x1	1h 20m	📝
4. Analytics Engineering	15 Feb	22 Feb	1	x10	2h 41m	📝
5. Batch processing	22 Feb	04 Mar	1.5			📝
6. Streaming	04 Mar	15 Mar	1.5			📝
RisingWave workshop	04 Mar	18 Mar	n/a			📝
Project (attempt 1)	18 Mar	01 Apr	2			📝
Project evaluation (attempt 1)	01 Apr	08 Apr	1			📝
Project (attempt 2)	01 Apr	15 Apr	2			📝
Project evaluation (attempt 2)	15 Apr	29 Apr	1			📝

Prep

Here is a checklist of what you need:

Set up virtual environment for python development
Install Docker Desktop
Get Google Cloud account
Install Terraform (you can follow the docs, or like me, install it in a conda environment)

Create a python virtual environment

I use mamba to manage my virtual environments, see env.yaml for requirements (This will be updated as I move through the course).

Install Docker Desktop

Setting up Docker with Windows 11 and WSL is very easy. Assuming WSL is already installed, install Docker Desktop on Windows. To enable the docker CLI on your distro of choice within WSL, just adjust the settings in Docker Desktop:

Settings > Resources > WSL integration
Select the distros where you want to enable it to use docker commands.

Modules

1. Introduction and Prerequisites

This section will cover Docker, running postgres and pgAdmin containers, some SQL basics and setting up cloud resources in Google Cloud using Terraform.

📚 Resources

📺 Videos

Bonus videos:

2. Workflow Orchestration

This section covers workflow orchestration with Mage.

📚 Resources

📺 Videos

Deployment videos (they say optional, but this is pretty crucial for me):

Deployment Prerequisites
Google Cloud Permissions
Deploying to Google Cloud Part 1
There seems to be a missing video here. See notes for details on how to deploy using Terraform.
Deploying to Google Cloud Part 2
Next Steps

Office hours recording here.

3. Data Warehouse

In this section we will talk about data warehousing in general and use Google BigQuery as an example.

📚 Resources

📺 Videos

4. Analytics Engineering

📚 Resources

📺 Videos

Optional video (but watch this first if like me, you still don't have the full green and yellow taxi data in GCP or local postgres db): Hack for loading data to BigQuery

1: Analytics Engineering Basics
2: What is dbt?
Start Your dbt Project
- 3: BigQuery and dbt Cloud
- 4: Postgres and dbt Core Locally
5: Build the First dbt Models
6: Testing and Documenting the Project
Deployment using
- 7: dbt Cloud
- 8: dbt locally
Visualising the data with
- 9: Google Data Studio
- 10: Metabase

5. Batch Processing

📚 Resources

📺 Videos

1: Introduction to Batch Processing
2: Introduction to Spark
3: First Look at Spark/PySpark
4: Spark Dataframes
5: SQL with Spark
6: Anatomy of a Spark Cluster
7: GroupBy in Spark
8: Joins in Spark

9m 30s +

Optional:

Installing Spark (Linux)
Preparing Yellow and Green Taxi Data
Resilient Distributed Datasets
- Operations on Spark RDDs
- Spark RDD mapPartition
Running Spark in the Cloud
- Connecting to Google Cloud Storage
- Creating a Local Spark Cluster

Workshops

dlt

The workshop quickly covers how to build data ingestion pipelines using dlt. It includes:

Extracting data from APIs, or files.
Normalizing and loading data
Incremental loading

📚 Resources

Workshop
Homework
dlt docs

📺 Video

Data Ingestion From APIs to Warehouses - Adrian Brudaru

sf-pear/data-engineering

Data Engineering Zoomcamp

Deadlines

Prep

Create a python virtual environment

Install Docker Desktop

Modules

1. Introduction and Prerequisites

📚 Resources

📺 Videos

2. Workflow Orchestration

📚 Resources

📺 Videos

3. Data Warehouse

📚 Resources

📺 Videos

4. Analytics Engineering

📚 Resources

📺 Videos

5. Batch Processing

📚 Resources

📺 Videos

Workshops

dlt

📚 Resources

📺 Video