This is my collection of notes and code from following the DataTalks Data Engineering Zoomcamp.
ποΈ Project's timeline
Module | Start Date | Homework Due | Weeks to complete | Videos | Duration | Notes |
---|---|---|---|---|---|---|
1. Introduction & Prerequisites | 15 Jan | 25 Jan | 2 | x9 | 2h 50m | π |
2. Workflow Orchestration | 29 Jan | 05 Feb | 1 | x11 | 1h 32m | π |
3. Data Warehouse | 05 Feb | 12 Feb | 1 | x6 | 1h 01m | π |
dlt workshop | 05 Feb | 15 Feb | 1.5 | x1 | 1h 20m | π |
4. Analytics Engineering | 15 Feb | 22 Feb | 1 | x10 | 2h 41m | π |
5. Batch processing | 22 Feb | 04 Mar | 1.5 | π | ||
6. Streaming | 04 Mar | 15 Mar | 1.5 | π | ||
RisingWave workshop | 04 Mar | 18 Mar | n/a | π | ||
Project (attempt 1) | 18 Mar | 01 Apr | 2 | π | ||
Project evaluation (attempt 1) | 01 Apr | 08 Apr | 1 | π | ||
Project (attempt 2) | 01 Apr | 15 Apr | 2 | π | ||
Project evaluation (attempt 2) | 15 Apr | 29 Apr | 1 | π |
Here is a checklist of what you need:
- Set up virtual environment for python development
- Install Docker Desktop
- Get Google Cloud account
- Install Terraform (you can follow the docs, or like me, install it in a conda environment)
I use mamba to manage my virtual environments, see env.yaml
for requirements (This will be updated as I move through the course).
Setting up Docker with Windows 11 and WSL is very easy. Assuming WSL is already installed, install Docker Desktop on Windows. To enable the docker CLI on your distro of choice within WSL, just adjust the settings in Docker Desktop:
- Settings > Resources > WSL integration
- Select the distros where you want to enable it to use
docker
commands.
This section will cover Docker, running postgres and pgAdmin containers, some SQL basics and setting up cloud resources in Google Cloud using Terraform.
- 1: Introduction to Docker
- 2: Ingesting NY Taxi Data to Postgres
- 3: Connecting pgAdmin and Postgres
- 4: Dockerizing the Ingestion Script
- 5: Running Postgres and pgAdmin with Docker-Compose
- 6: SQL Refreshser
- 7: Terraform Primer
- 8: Terraform Basics
- 9: Terraform Variables
Bonus videos:
- Setting up the Environment on Google Cloud
- Using Github Codespaces for the Course
- Port Mapping and Networks in Docker
- Optional (if you have issues with pgcli): Connecting to Postgres with Jupyter and Pandas
This section covers workflow orchestration with Mage.
- 1: What is Orchestration?
- 2: What is Mage?
- 3: Configure Mage
- 4: A Simple Pipeline
- 5: Configuring Postgres
- 6: API to Postgres
- 7: Configuring GCP
- 8: ETL: API to GCS
- 9: ETL: GCS to BigQuery
- 10: Parameterized Execution
- 11: Backfills
Deployment videos (they say optional, but this is pretty crucial for me):
- Deployment Prerequisites
- Google Cloud Permissions
- Deploying to Google Cloud Part 1
- There seems to be a missing video here. See notes for details on how to deploy using Terraform.
- Deploying to Google Cloud Part 2
- Next Steps
Office hours recording here.
In this section we will talk about data warehousing in general and use Google BigQuery as an example.
- 1: Data Warehouse and BigQuery
- 2: Partioning and Clustering
- 3: BigQuery Best Practices
- 4: Internals of Big Query
- 5: BigQuery Machine Learning
- 6: BigQuery Machine Learning Deployment
Optional video (but watch this first if like me, you still don't have the full green and yellow taxi data in GCP or local postgres db): Hack for loading data to BigQuery
- 1: Analytics Engineering Basics
- 2: What is dbt?
- Start Your dbt Project
- 5: Build the First dbt Models
- 6: Testing and Documenting the Project
- Deployment using
- Visualising the data with
- 1: Introduction to Batch Processing
- 2: Introduction to Spark
- 3: First Look at Spark/PySpark
- 4: Spark Dataframes
- 5: SQL with Spark
- 6: Anatomy of a Spark Cluster
- 7: GroupBy in Spark
- 8: Joins in Spark
9m 30s +
Optional:
- Installing Spark (Linux)
- Preparing Yellow and Green Taxi Data
- Resilient Distributed Datasets
- Operations on Spark RDDs
- Spark RDD mapPartition
- Running Spark in the Cloud
- Connecting to Google Cloud Storage
- Creating a Local Spark Cluster
The workshop quickly covers how to build data ingestion pipelines using dlt. It includes:
- βExtracting data from APIs, or files.
- βNormalizing and loading data
- βIncremental loading