Data Engineering Projects

This repository aims to demonstrate how to build data pipelines and systems, providing a better understanding of concepts such as ETL, data lakes, and their roles in a data system. The core technologies used are Mage and Docker, upon which we will build and integrate other services to enhance our exploration and understanding.

Repository structure

  • Mage: This directory contains all the files and scripts necessary to execute the pipelines. For installation instructions, refer to the official Mage documentation or the first tutorial, which provides a detailed guide on installing Mage.
  • Dockerfile: We use this file to basically run Mage -Note that it contains few Spark specific commands that are not necessary for projects without Spark interactions
  • Makefile: This is where all the commands that we will use commonly (you can add yours)
  • Docker-Compose: This is the file we use to include the services we want to run every time. At the moment it contains all services I use but you can adjust it based on your needs.

To get full understanding of how to build the repository from scratch you can check the turotial here or you can simply clone the repo and start from there.

Tutorials - Projects

1. Building a Local Data Lake from scratch with MinIO, Iceberg, Spark, StarRocks, Mage, and Docker

In the first tutorial/project, I guide you through building the repository using Mage as the main orchestrator. We will leverage various technologies to create your local data lake with Iceberg and query your data using StarRocks.

You can find the relevant article with a detailed guide here: Medium blog

The isolated code for that project is here: SparkDataLake