/iceberg-demo

A sample implementation of stream writes to an Iceberg table on GCS using Flink and reading it using Trino

Primary LanguageJavaMIT LicenseMIT

Introduction

Apache Iceberg is an advanced table specification for high volume, fast analytical tables. It can be interacted with a large number of tools (both on the Read and Write ends). It is an improvement to the first generation table format Apache Hive and is siblings with the contemporary Apache Hudi and Delta lake standards.

These are some of the features of Apache Iceberg table format:

  • ACID transaction
  • Hidden partitions and partition evolution
  • Schema evolution
  • Time travel

This article Building an Enterprise-Level Real-Time Data Lake Based on Flink and Iceberg on Alibaba Engineering, talks in detail about the architectural patterns that may be realized to achieve a functioning Real time Data Lake

Real time Delta Lake

In this bare bones example, data is made available from Kafka in a Trino Iceberg table in real-time via Flink stream processing.

Kafka Flink Iceberg Trino

These are the layers used in this demo:

  1. Object Store / File System: Google Cloud storage.
  2. Streaming Source: Kafka. The data can be written to Kafka by some "other origin" source. For example, in this example, data is being written to Kafka in a long running for loop
  3. Stream processing system: Apache Flink
  4. Streaming Targets: Two target sinks are used: Kafka and Iceberg table (The writes to the Iceberg table happen via the Hadoop GCS connector - shaded jar is preferred to avoid dependency versioning mess)
  5. Query Engine: Trino (via Iceberg connector on GCS)

Prerequisites

  1. Setup Kafka - For local development, Kafka may be made available via a docker container.
  2. Setup Trino with the necessary connectors - Clone tj---/trino-hive repository and follow these steps to launch the necessary containers.
  3. Create an Iceberg table via Trino: Follow these steps to create an Iceberg table.
  4. Note all of the parameters from the steps above and update the constants in the Driver.Configurations class.

Running the Demo

  1. Run Driver::Main to setup and initiate the Flink pipeline. Make sure that there are no errors. Once ready, the pipeline becomes ready to process the data.
  2. Run SampleGenerator::Main that ingests records to Kafka source topics. Within a short time, based on the configured Flink check-point, data becomes available in the table
  3. Run TrinoIcebergReader::Main to read data from the sample table via the Trino engine.