/openlake

Build Data Lake using Open Source tools

Primary LanguageJupyter NotebookGNU Affero General Public License v3.0AGPL-3.0

Openlake

Welcome to the Openlake repository! In this repository, we will guide you through the steps to build a Data Lake using open source tools like Spark, Kafka, Trino, Apache Iceberg, Airflow, and other tools deployed in Kubernetes with MinIO as the object store.

What is a Data Lake?

A Data Lake is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale. It enables you to break down data silos and create a single source of truth for all your data, which can then be used for various analytical purposes.

Prerequisites

Before you get started, you will need the following:

  • A Kubernetes cluster: You will need a Kubernetes cluster to deploy the various tools required for building a Data Lake. If you don't have a Kubernetes cluster, you can set one up using tools like kubeadm/kind/minikube or a managed Kubernetes service like Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS)
  • kubectl: Command line tool for communicating with Kubernetes cluster
  • A MinIO instance: You will need a MinIO instance to use as the object store for your Data Lake
  • MinIO Client(mc): You will need mc to run commands to peform actions in MinIo
  • A working knowledge of Kubernetes: You should have a basic understanding of Kubernetes concepts and how to interact with a Kubernetes cluster using kubectl
  • Familiarity with the tools used in this repo: You should have a basic understanding of the tools used in this repo, including Spark, Kafka, Trino, Apache Iceberg etc.
  • JupyterHub/ Notebook (Optional): If you are palnning to walkthrough the instructions using Notebooks

Table of Contents

Apache Spark

In this section we will cover

  • Setup spark on Kubernetes using spark-operator
  • Run spark jobs with MinIO as object storage
  • Use different type of S3A Committers, checkpoints and why running spark jobs on object storage (MinIO) is a much better approach than HDFS.
  • Peform CRUD operations on Apache Iceberg table using Spark
  • Spark Streaming

Setup Spark on k8s

To run spark jobs on kubernetes we will use spark-operator. You can follow the complete walkthrough here or use the notebook

Run Spark Jobs with MinIO as Object Storage

Reading and writing data from and to MinIO using spark is very easy. You can follow the complete walkthrough here or use the notebook

Maintain Iceberg Table using Spark

Apache Iceberg is a table format for huge analytic datasets. It supports ACID transactions, scalable metadata handling, and fast snapshot isolation. You can follow the complete walkthrough using the notebook

Dremio

Dremio is a general purpose engine that enables you to query data from multiple sources, including object stores, relational databases, and data lakes. In this section we will cover

  • Setup Dremio on Kubernetes using Helm
  • Run Dremio queries with MinIO as object storage and Iceberg tables

Setup Dremio on K8s

To setup Dremio on kubernetes we will use Helm. You can follow the complete walkthrough using the notebook

Access MinIO using Dremio

You can access datasets or Iceberg tables stored in MinIO using Dremio by adding a new source. You can follow the complete walkthrough using the notebook

Apache Kafka

Apache Kafka is a distributed streaming platform. It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies. In this section we will cover how to setup Kafka on Kubernetes and store Kafka topics in MinIO.

Setup Kafka on K8s

To setup Kafka on kubernetes we will use Strimzi. You can follow the complete walkthrough using the notebook

Store Kafka Topics in MinIO

You can store Kafka topics in MinIO using Sink connectors. You can follow the complete walkthrough using the notebook

Kafka Schema Registry and Iceberg Table (experimental)

You can use Kafka Schema Registry to store schemas for data management for kafka topics and you can als use them to create Iceberg tables (expreimental). You can follow the complete walkthrough using the notebook

Kafka Spark Structured Streaming

In this section we will cover how to use Spark Structured Streaming to read data from Kafka topics and write to MinIO.

Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can follow the complete walkthrough using the notebook

End-to-End Spark Structured Streaming for Kafka

You can use Spark Structured Streaming to read data from Kafka topics and write to MinIO. You can follow the complete walkthrough using the notebook

Join Community

Openlake is a MinIO project. You can contact the authors over the slack channel:

License

Openlake is released under GNU AGPLv3 license. Please refer to the LICENSE document for a complete copy of the license.