This repository documents my journey through the diverse and exciting field of data engineering. It serves as a curated collection of resources, notes, projects, and practical examples that have guided me in building a solid foundation in data engineering concepts, tools, and best practices.
Syllabus
- Course overview
- Introduction to GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment for the course
- Homework
- Data Lake
- Workflow orchestration
- Introduction to Prefect
- ETL with GCP & Prefect
- Parametrizing workflows
- Prefect Cloud and additional resources
- Homework
- Data Warehouse
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- Integrating BigQuery with Airflow
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Introduction to Kafka
- Schemas (avro)
- Kafka Streams
- Kafka Connect and KSQL
Putting everything we learned to practice
- Week 7 and 8: working on your project
- Week 9: reviewing your peers
- Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Prefect: Workflow Orchestration
- dbt: Data Transformation
- Spark: Distributed Processing
- Kafka: Streaming