Uncorking Analytics with Apache Kafka®, Apache Flink, and Apache Pinot™

Table of Contents

Abstract
Learning Objectives
Prerequisites
Training Outline and Timeline
Let’s Get Going!
Practice Part Overview

Abstract

Apache Pinot is a high-performance database engineered to serve analytical queries with extremely high concurrency, boasting latencies as low as tens of milliseconds. It excels at ingesting streaming data from sources like Apache Kafka and is optimized for real-time, user-facing analytics applications.

In this full-day training, we will explore the architectures of Apache Kafka, Apache Flink, and Apache Pinot. We will run local clusters of each system, studying the role each plays in a real-time analytics pipeline. Participants will begin by ingesting static data into Pinot and querying it. Not content to stop there, we’ll add a streaming data source in Kafka, and ingest that into Pinot as well, showing how both data sources can work together to enrich an application. We’ll then examine which analytics operations belong in the analytics data store (Pinot) and which ones should be computed before ingestion. These operations will be implemented in Flink. Having put all three technologies to use on your own in hands-on exercises, you’ll leave prepared to begin exploring them together for your own real-time, user-facing analytics applications.

Learning Objectives

At the successful completion of this training, you will be able to:

List the essential components of Pinot, Kafka, and Flink.
Explain the architecture of Apache Pinot and its integration with Apache Kafka and Apache Flink.
Form an opinion about the proper role of Kafka, Flink, and Pinot in a real-time analytics stack.
Implement basic stream processing tasks with Apache Flink.
Create a table in Pinot, including schema definition and table configuration.
Ingest batch data into an offline table and streaming data from a Kafka topic into a real-time table.
Use the Pinot UI to monitor and observe your Pinot cluster.
Use the Pinot Query Console

Prerequisites

To participate in this workshop, you will need the following:

Docker Desktop: We will use Docker to run Pinot, Kafka, and Flink locally. If you need to install it, please download Docker Desktop and follow the instructions to install it at https://www.docker.com/get-started/.
Resources: Pinot works well in Docker but is not designed as a desktop solution. Running it locally requires a minimum of 8GB of Memory and 10GB of disk space.

Training Outline and Timeline

Welcome Breakfast

Duration 7:30 AM - 9:00 AM

Participants arrive and enjoy breakfast
Time for networking and setting up personal workstations, PC laptop uses are finding power sources

Module 1: Introduction to Real-Time Analytics and Overview of Technologies

Duration: 1 hr (9:00 AM - 10:00 AM) Speaker: Viktor

Discuss the concept of real-time analytics
Overview of Apache Kafka, Apache Flink, and Apache Pinot
How these technologies work together in a real-time analytics stack

Module 2: Setting Up Local Clusters with Docker

Duration: 30 min (10:00 AM - 10:30 AM) Speakers: Viktor (Lead), Upkar (TA)

Guide on installing Docker (if not pre-installed)
Setup of Apache Kafka, Apache Flink, and Apache Pinot clusters
- checking internet connection
- pulling images
- smoke test
Ensuring everyone’s local environment is configured correctly

Break

Duration: 30 min (10:30 AM - 11:00 AM)

Module 3: Ingesting Data into Apache Pinot

Duration 1hr (11:00 AM - 12:00 PM) Speaker: Viktor

Creating a schema and configuring a table in Pinot
Ingesting static data into an offline table
Querying data in Apache Pinot

Lunch Break 12:00 PM - 1:00 PM

Module 4: Integrating Kafka with Pinot for Real-Time Data Ingestion

Duration: 1 hr (1:00 PM - 2:00 PM) Speaker: Viktor (Lead), Upkar (TA)

Kafka 101 refresher
Setting up a Kafka topic
Streaming data ingestion from Kafka to a real-time table in Pinot
Using the Pinot UI to monitor and manage the cluster

Break 2:00 PM - 2:30 PM

Module 5: Stream Processing with Apache Flink

Duration: 1hr (2:30 PM - 3:30 PM) Speakers: Upkar (Lead), Viktor (TA)

Basic concepts of stream processing
Implementing stream processing tasks with Apache Flink
Enriching Kafka streams before ingestion into Pinot

Wrap-Up and Q&A

Duration: 30 min (3:30 PM - 4:00 PM) Speakers: Viktor, Upkar

Recap of the day’s lessons
Open floor for questions
Discussion on potential use cases in participants' work

Equipment and Software Check

Ensure all participants have installed Docker Desktop and have the necessary system resources as outlined in the prerequisites.

Let’s Get Going!

Before the Workshop

To ensure you are fully prepared for the workshop, please follow these guidelines:

Version Control:

Check out the latest version of the workshop repository to access all necessary materials and scripts.

git clone https://github.com/gAmUssA/uncorking-analytics-with-pinot-kafka-flink.git
cd uncorking-analytics-with-pinot-kafka-flink

Docker:
- Install Docker if it isn’t already installed on your system. Download it from https://www.docker.com/products/docker-desktop.
- Before the workshop begins, pull the necessary Docker images to ensure you have the latest versions:
  make pull_images
Integrated Development Environment (IDE):
- Install Visual Studio Code (VSCode) to edit and view the workshop materials comfortably. Download VSCode from https://code.visualstudio.com/.
- Add the AsciiDoc extension from the Visual Studio Code marketplace to enhance your experience with AsciiDoc formatted documents.

During the Workshop

Validate Setup:
- Before diving into the workshop exercises, verify that all Docker containers needed for the workshop are running correctly:
  docker ps
- This command helps confirm that there are no unforeseen issues with the Docker containers, ensuring a smooth operation during the workshop.
Using VSCode:
- Open the workshop directory in VSCode to access and edit files easily. Use the AsciiDoc extension to view the formatted documents and instructions:
  code .

Troubleshooting Tips

Docker Issues:
- If Docker containers fail to start or crash, use the following command to inspect the logs and identify potential issues:
  docker logs <container_name>
- This can help in diagnosing problems with specific services.
Network Issues:
- Ensure no applications are blocking the required ports. If ports are in use or blocked, reconfigure the services to use alternative ports or stop the conflicting applications.

Clean Up Post-Workshop

Removing Docker Containers:
- To clean up after the workshop, you might want to remove the Docker containers used during the session to free up resources:
  make stop_containers
- Additionally, prune unused Docker images and volumes to recover disk space:
  docker system prune -a docker volume prune

These steps and tips are designed to prepare you thoroughly for the workshop and to help address common issues that might arise, ensuring a focused and productive learning environment.

Practice Part Overview

The practical exercises of this workshop are divided into three distinct parts, each designed to give you hands-on experience with Apache Pinot’s capabilities in different scenarios. Below are the details and objectives for each part:

01_pinot/README.adoc

02_pinot-kafka/README.adoc

03_pinot-kafka-flink/README.adoc