Data Engineering Tech Stack: Learn Modern Data Engineering Tools

This repository provides a comprehensive environment to learn and practice modern data engineering concepts using Docker Compose. With this setup, you'll gain hands-on experience working with some of the most popular tools in the data engineering ecosystem. Whether you’re a beginner or looking to expand your skills, this tech stack offers everything you need to work with ETL/ELT pipelines, data lakes, querying, and visualization.

What You Will Learn

Apache Spark: Learn distributed data processing with one of the most powerful engines for large-scale data transformations.
Apache Iceberg: Understand how to work with Iceberg tables for managing large datasets in data lakes with schema evolution and time travel.
Project Nessie: Explore version control for your data lakehouse, allowing you to track changes to datasets just like Git for code.
Apache Airflow: Master workflow orchestration and scheduling for complex ETL/ELT pipelines.
Trino: Query your data from multiple sources (MinIO, PostgreSQL, Iceberg) with a fast federated SQL engine.
Apache Superset: Create interactive dashboards and visualizations to analyze and present your data.
MinIO: Learn about object storage and how it integrates with modern data pipelines, serving as your S3-compatible storage layer. (Using this architecture for a NYC Taxi Data Project: https://lnkd.in/gWMj6RS7. You'll be seeing the bucket name as "nyc-project")
PostgreSQL: Use a relational database for metadata management and storing structured data.

Overview
What You Will Learn
Services
How to Run
Learning Objectives for Each Tool
Environment Variables
Troubleshooting
License

Services

MinIO (Object Storage)

Purpose: Acts as an S3-compatible object storage layer for raw and processed data.
What You’ll Learn:
- Uploading and managing files via the MinIO Console.
- Using MinIO as a source and destination for ETL pipelines.
Console URL: http://localhost:9001

Airflow

Purpose: Workflow orchestration and ETL pipeline management.
What You’ll Learn:
- Building and scheduling DAGs (Directed Acyclic Graphs) to automate workflows.
- Managing and monitoring pipelines through the Airflow web interface.
Web Interface URL: http://localhost:8081

Extraction and Transformation

Purpose: Run Spark and custom Python scripts for data ingestion and transformation.
What You’ll Learn:
- Writing Spark jobs to process large-scale datasets.
- Using Python scripts to ingest and transform data into MinIO or PostgreSQL.

PostgreSQL

Purpose: A relational database for storing metadata, structured data, and managing transactions.
What You’ll Learn:
- Querying relational datasets using SQL.
- Storing and retrieving structured data for analysis or visualization.

Project Nessie

Purpose: Version control for data lakes and Iceberg tables.
What You’ll Learn:
- Creating branches and commits for data changes.
- Rolling back or time-traveling to previous versions of datasets.
API URL: http://localhost:19120

Trino

Purpose: A federated query engine for SQL-based exploration of multiple data sources.
What You’ll Learn:
- Querying data stored in MinIO, PostgreSQL, and Iceberg tables.
- Using SQL to join data from different sources.
Web Interface URL: http://localhost:8080

Superset

Purpose: Visualization and dashboarding tool for creating insights from data.
What You’ll Learn:
- Building interactive dashboards connected to Trino and PostgreSQL.
- Analyzing data through visualizations.
Dashboard URL: http://localhost:8088

How to Run

Clone the Repository

git clone https://github.com/username/dataengineering-tech-stack.git
cd dataengineering-tech-stack

Set Environment Variables

Fill in your configurations in .env/airflow.env, .env/minio.env, .env/postgres.env, etc.

Build & Start Services
```
  docker-compose up -d --build
```
Access Servies

MinIO Console: http://localhost:9001
Airflow Web UI: http://localhost:8081
PostgreSQL: localhost:5432
Nessie: http://localhost:19120
Trino: http://localhost:8080
Superset: http://localhost:8088

Learning Objectives for Each Tool

Spark

Understand how distributed computing works: Learn how Apache Spark processes large datasets in parallel across multiple nodes, making it an essential tool for handling big data efficiently.
Write transformations for ETL pipelines on large-scale datasets: Build and execute transformations that extract, clean, and load data into storage or analytics layers.

Iceberg

Work with Iceberg tables for schema evolution and time travel: Gain experience managing dataset changes over time without breaking downstream dependencies.
Manage partitions for optimized querying: Use Iceberg's built-in partitioning to improve query performance on large datasets.

Project Nessie

Learn to implement Git-like workflows for datasets: Track changes, create branches, and roll back changes to maintain consistency in your data pipelines.
Manage branches, tags, and commits for Iceberg tables: Use Nessie to version-control datasets and simplify collaboration across teams.

Airflow

Build workflows that orchestrate Spark, MinIO, and PostgreSQL: Automate complex data workflows involving multiple tools and dependencies.
Monitor and troubleshoot DAG executions: Learn to manage and debug Directed Acyclic Graphs (DAGs) for efficient task scheduling.

Trino

Query heterogeneous data sources (e.g., MinIO, PostgreSQL) in a unified SQL layer: Use Trino to query structured and unstructured data seamlessly across multiple backends.
Analyze Iceberg tables and large datasets efficiently: Leverage Trino's SQL engine to perform ad hoc analysis or integrate with BI tools.

Superset

Build visualizations and interactive dashboards: Create engaging charts and dashboards to derive insights from your data.
Connect Superset to Trino and PostgreSQL for real-time insights: Visualize data in near real-time to support decision-making.

MinIO

Store and retrieve files in an S3-compatible system: Manage object storage for raw and processed data in a local or cloud environment.
Integrate MinIO with Spark and Airflow pipelines: Use MinIO as a central hub for ingesting, transforming, and exporting data in your workflows.

Armaan1Gohil/dataengineering-tech-stack

Data Engineering Tech Stack: Learn Modern Data Engineering Tools

What You Will Learn

Table of Contents

Services

MinIO (Object Storage)

Airflow

Extraction and Transformation

PostgreSQL

Project Nessie

Trino

Superset

How to Run

Learning Objectives for Each Tool

Spark

Iceberg

Project Nessie

Airflow

Trino

Superset

MinIO