/data-engineering

Data Engineering Zoomcamp

Primary LanguagePython

Data Engineering Zoomcamp

This is my collection of notes and code from following the DataTalks Data Engineering Zoomcamp.

Deadlines

πŸ—“οΈ Project's timeline

Module Start Date Homework Due Weeks to complete Videos Duration Notes
1. Introduction & Prerequisites 15 Jan 25 Jan 2 x9 2h 50m πŸ“
2. Workflow Orchestration 29 Jan 05 Feb 1 x11 1h 32m πŸ“
3. Data Warehouse 05 Feb 12 Feb 1 x6 1h 01m πŸ“
dlt workshop 05 Feb 15 Feb 1.5 x1 1h 20m πŸ“
4. Analytics Engineering 15 Feb 22 Feb 1 x10 2h 41m πŸ“
5. Batch processing 22 Feb 04 Mar 1.5 πŸ“
6. Streaming 04 Mar 15 Mar 1.5 πŸ“
RisingWave workshop 04 Mar 18 Mar n/a πŸ“
Project (attempt 1) 18 Mar 01 Apr 2 πŸ“
Project evaluation (attempt 1) 01 Apr 08 Apr 1 πŸ“
Project (attempt 2) 01 Apr 15 Apr 2 πŸ“
Project evaluation (attempt 2) 15 Apr 29 Apr 1 πŸ“

Prep

Here is a checklist of what you need:

  • Set up virtual environment for python development
  • Install Docker Desktop
  • Get Google Cloud account
  • Install Terraform (you can follow the docs, or like me, install it in a conda environment)

Create a python virtual environment

I use mamba to manage my virtual environments, see env.yaml for requirements (This will be updated as I move through the course).

Install Docker Desktop

Setting up Docker with Windows 11 and WSL is very easy. Assuming WSL is already installed, install Docker Desktop on Windows. To enable the docker CLI on your distro of choice within WSL, just adjust the settings in Docker Desktop:

  • Settings > Resources > WSL integration
  • Select the distros where you want to enable it to use docker commands.

Modules

1. Introduction and Prerequisites

This section will cover Docker, running postgres and pgAdmin containers, some SQL basics and setting up cloud resources in Google Cloud using Terraform.

πŸ“š Resources

πŸ“Ί Videos

Bonus videos:

2. Workflow Orchestration

This section covers workflow orchestration with Mage.

πŸ“š Resources

πŸ“Ί Videos

Deployment videos (they say optional, but this is pretty crucial for me):

Office hours recording here.

3. Data Warehouse

In this section we will talk about data warehousing in general and use Google BigQuery as an example.

πŸ“š Resources

πŸ“Ί Videos

4. Analytics Engineering

πŸ“š Resources

πŸ“Ί Videos

Optional video (but watch this first if like me, you still don't have the full green and yellow taxi data in GCP or local postgres db): Hack for loading data to BigQuery

5. Batch Processing

πŸ“š Resources

πŸ“Ί Videos

  • 1: Introduction to Batch Processing
  • 2: Introduction to Spark
  • 3: First Look at Spark/PySpark
  • 4: Spark Dataframes
  • 5: SQL with Spark
  • 6: Anatomy of a Spark Cluster
  • 7: GroupBy in Spark
  • 8: Joins in Spark

9m 30s +

Optional:

Workshops

dlt

The workshop quickly covers how to build data ingestion pipelines using dlt. It includes:

  • ​Extracting data from APIs, or files.
  • ​Normalizing and loading data
  • ​Incremental loading

πŸ“š Resources

πŸ“Ί Video