Data Engineering Roadmap

==========================

Welcome to our Data Engineering Roadmap!

This roadmap is designed to help you navigate the world of data engineering, from the fundamentals to advanced topics. Whether you're a beginner or an experienced professional, this roadmap will guide you through the key concepts, tools, and technologies you need to master.

Data Engineering Fundamentals


  • Introduction to Data Engineering
  • Data Engineering Lifecycle:
    • Data Collection
    • Data Ingestion
    • Data Storage and Management
    • Data Transformation
    • Data Serving
  • Overview of Data Pipelines (ETL/ELT)
  • Overview of Data Modeling
  • Overview of Cloud Data Engineering
  • Soft Skills for Data Engineers

Linux and Git


  • Basic Linux Commands:

    • Introduction to the Command Line
    • Creating and Navigating Directories
    • Listing Files in Directories
    • Creating and Viewing Files
    • Copying and Moving Files
    • Renaming Files
    • Absolute and Relative Paths
    • Viewing and Managing Processes
  • GitHub:

    • Creating a Repo
    • Cloning a Repo
    • Git Add
    • Git Commit
    • Git Push
    • Git Branch
    • Pull Request
    • Resolving Git Conflicts
    • Creating a Git README and Documenting Projects

SQL


  • Introduction to Databases and Data Warehousing
  • Downloading the Postgres Server Locally
  • Basic Queries
    • DDL - Data Definition Language
    • DML - Data Manipulation Language
    • DCL - Data Control Language
  • Joins
  • SQL Data Cleaning
  • Window Functions
  • Introduction to Advanced SQL (Subquery & CTE)
  • Creating Tables/Views (Working with Tables)
  • Stored Procedures
  • Entity-Relationship Diagrams (ERDs)

Python


  • Python Basics
    • Control Flow
    • Operators:
    • Arithmetic Operators
    • Assignment Operators
    • Comparison Operators
    • Logical Operators
    • Identity Operators
    • Membership Operators
    • Logical Statements
      • If and Else Statements
    • Loops:
      • For Loop
      • While Loop
    • Functions:
    • Normal Functions
    • Generic Functions
      • Non-Default Arguments
      • Default Arguments
    • *Args and **kwargs
    • Modules & Packages:
      • In-Built Modules
      • Custom Modules
      • Packages
    • Errors and Exceptions
  • Data Structures
  • File Handling
  • Data Manipulation with Pandas
  • Database Interaction
  • API and Web Scraping
  • ETL Process and Data Pipeline
  • Version Control for Projects
  • Introduction to OOP

Data Modeling


  • Fundamental Concepts
  • Basic Techniques for Dimension Tables
  • Basic Techniques for Fact Tables
  • Slowly Changing Dimensions

DBT


  • DBT Fundamentals
  • Understanding Jinja, Macros, and Testing in DBT
  • DBT Packages
  • Introduction to DBT Cloud

Docker


  • Overview of Docker and Internals of Docker
  • Dockerfile
  • Docker Images
  • Docker Containers
  • Understanding Docker Volumes
  • Docker Networking
  • Introduction to YAML
  • Docker Compose and Anchors in Docker Compose

CI/CD


  • GitHub Actions

Data Integration


  • Airbyte:
    • Airbyte Concepts
    • Source
    • Destination
    • Connection
    • Connector
    • Sync
    • Airbyte Architecture:
      • Architecture Overview
      • WebApp
      • API Server
      • Metadata Database
      • Temporal
      • Worker
    • Running Airbyte in Docker
    • Understanding Source Configuration
    • Understanding Destination Configuration
    • Configuring a Full Synchronization between Source and Destination:
      • S3 to Postgres Database
      • Postgres Database to Redshift
    • How Sync Works Under the Hood

Orchestration


  • Introduction to Apache Airflow
  • Airflow Concepts:
    • Workflow
    • DAG
    • Task
    • Operators
    • Dependencies
  • Installation and Setup:
    • Prerequisites
    • Installation
    • Configuration
  • Airflow Architecture:
    • Architecture Overview
      • WebServer
      • Metadata Database
      • Scheduler
      • Worker
      • How Airflow Works
  • Creating Your First DAG
  • Understanding DAG Configuration
  • Understanding Task Configuration
  • Understanding Airflow Variables
  • Advanced DAG Concepts
  • Monitoring and Debugging
  • Airflow Configuration and Best Practices
  • Projects

Cloud


  • Introduction to the Cloud
  • IAM
  • Data Lake
  • Python Libraries to Interact with the Cloud
  • Data Catalog
  • Relational Database Services (RDS)
  • Data Warehouse
  • ETL Services
  • Orchestration Services
  • Compute

Spark


  • Introduction to Spark
  • Installation
  • Spark SQL and DataFrame API
  • RDDs
  • Transformations and Actions
  • Spark Streaming
  • Structured Streaming
  • Tuning and Optimization

Terraform


  • What terraform is
  • How terraform works
  • Terraform state file
  • Remote state file
  • Basic provisioning
    • Resource referencing
    • Data source
    • Local usage
  • Modules
    • Module overview
    • Module structure
    • Create a simple resource module

Kafka


  • Apache Kafka Overview
  • Kafka Architecture
  • Kafka topic, partition and offset
  • Producer and Consumer
  • Consumer group and Rebalancing protocol
  • Kafka Connect

Optional


  • Introduction to Kubernetes