pulsar-flink-patterns: A Java repository from rhollosy

Environment Setup
Warmup - Connecting to Pulsar
Handling Multiple Streams
Warmup - Keyed State
Performing Data Enrichment and Lookups
Buffering
Side Outputs
State Backends
Checkpoints, Savepoints and Restart Strategies
Additional Resources

Environment Setup

In order to run the code samples we will need a Pulsar and Flink cluster up and running. You can also run the Flink examples from within your favorite IDE in which case you don't need a Flink Cluster.

If you want to run the examples inside a Flink Cluster run to start the Pulsar and Flink clusters.

docker-compose up

if you want to run within your IDE you can just spin up a pulsar cluster

docker-compose up pulsar

When the cluster is up and running successfully run the following command:

./setup.sh

Warmup

Outcomes: How can we connect to Pulsar and start consuming events. We will see how we can achieve this by using:

Handling Multiple Streams

Use Case:
In some scenarios you might need to merge multiple streams together, for example data from two pulsar topics.
The input topics events can either the same or different schemas. When the input events have the same schema you can use the union function, otherwise you can use the connect function.

Outcome:
We will how to achieve this using two input transaction topics - one containing credit transactions and one with debits and merge these two streams into one datastream. We will see two different approaches:

Union Function: Input datastreams need to be of the same input data type
Connect Function: Input datastreams to be of different types

Other options to explore:

Window Joins
- the events need to belong to the same window and match on some join condition
Interval Joins
- the events need to be between the same interval
lowerBoundInterval < timeA - timeB < upperBoundInterval

Keyed State

Outcomes:

Understand the difference between Operator State and Keyed State
How Flink handles state and the available State primitives

We will review with an example that accumulates state per customer key and calculates: How many transactions we have per customer?
You can find the relevant examples here

Data Enrichment and Lookups

Use Case:
In some scenarios we need to be able to perform Data Enrichment and/or Data Lookups having on topic as an append-only stream while the other topic is a changelog stream - keeps only the latest state for a particular key.
We will see how to implement such use cases by using Flink's combining Flink's process function and Keyed State to enrich transaction events with user information.
You can find the relevant examples here.

Buffering

Use Case: This scenario is pretty similar to the previous one, but here we are buffering the events when there is some missing state.
We can not guarantee when both matching events will arrive into our system and in such case we might want to buffer the events until a matching one arrives.
You can find the relevant examples here.

Side Outputs

Use Case:
Some scenarios you might want to split one stream into multiple streams - stream branches - and forward them to different downstream operators for further processing.
We will see how to implement such a use cases in order to account for unhappy paths. Following our Data Enrichment example, we will see how to handle transaction events for which user information is not present (maybe due to some human error or system malfunction).
You can find the relevant examples here.

State Backends

Outcomes:

Understand Flink State backends
Use RocksDB as the state backend to handle state that can't fit in memory

Checkpoints, Savepoints and Restart Strategies

Outcomes:

Understand how Flink's Checkpointing mechanism works
Understand how we can combine Flink checkpoints and Restart Strategies to recover our job
Checkpoints vs Savepoints
How we can recover state from a previous checkpoint
You can find the relevant examples here.

rhollosy/pulsar-flink-patterns

Table of Contents

Environment Setup

Warmup

Handling Multiple Streams

Keyed State

Data Enrichment and Lookups

Buffering

Side Outputs

State Backends

Checkpoints, Savepoints and Restart Strategies

Additional Resources