apacha-kafka

Why Apache Kafka

Created by Linkedin, now open source project mainly maintained by Confluent
Distributed, resilient architecture, fault tolerant
Horizontal scalability :
- Can scale to 100s of brokers
- Can scale to millions of messages per second
High performance (latency of less than 10 ms) - real time
Used by 2000+ firms, 35% of the Fortune 500

Use cases

Messaging system
Activity Tracking
Gather metrics from many different locations
Application logs gathering
Stream processing (with the Kafka Streams API or Spark for example)
De-coupling of system dependencies
Integration with Spark, Flink, Storm, Hadoop, and many other Big Data technologies

Examples

Netflix uses Kafka to apply recommendations in real-time while you're watching TV shows
Uber uses Kafka to gather taxi, user and trip data in real-time to compute and forecast demand, and compute surge pricing in real-time
Linkedin uses Kafka to prevent spam, collect user interactions to make better connection recommendations in real-time
Remember that Kafka is only used as a transportation mechanism

Topics, partitions and offsets

Topics: a particular stream of data
- Similar to a table in a database (without all the constraints
- You can have as many topics as you want
- A topic is identified by its name
Topics are split in partitions
- Each partition is ordered
- Each message within a partition gets an incremental id, called offset
Offset only have a meaning for a specific partition.
- E.g. offset 3 in partition 0 doesn't represent the same data as offset 3 in partition 1
Order is guaranteed only within a partition (not across partitions)
Data is kept only for a limited time (default is one week)
Once the data is written to a partition, it can't be changed (immutability)
Data is assigned randomly to a partition unless a key is provided

Brokers

A Kafka cluster is composed of multiple brokers (servers)
Each broker is identified with its ID (integer)
Each broker contains certain topic partitions
After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster
A good number to get started is 3 brokers, but some big clusters have over 100 brokers

Brokers and topics

When we create a topic, kafka will automatically assign the topic and distribute it acroos all your brokers

Topic replication factor

Topics should have a replication factor > 1 (usually between 2 and 3)
This way if broker is down, another broker can serve the data
replication factor should be less than or equal to the total number of brokers

Concept of Leader for a partition

At any time only ONE broker can be a leader for a given partition
Only that leader can receive and serve data for a partition
The other brokers will synchronize the data
Therefore each partition has one leader and multiple ISR (in-sync replica)
Zookeeper will decide Leader and ISRs
If one broker is down, in which the partition is leader, then same partition on another broker will become the leader

Producers

Producers write data to topics (which is made of partitions)
Producers automatically know to which broker and partition to write to
In case of broker failures,Producers will automatically recover
The load is balanced to many brokers thanks to the number of partitions
Producers can choose to receive acknowledgement of data writes
- acks=0: Producer won't wait for acknowledgement (possible data loss)
- acks=1: Producer will wait for leader acknowledgement (limited data loss)
- acks=all: Leader + replicas acknowledgement (no data loss)

Message keys

Producers can choose to send a key with the message (string, number, etc..)
If key=null, data is sent round robin (broker 101 then 102 then 103...)
If a key is sent, then all messages for that key will always go to the same partition
A key is basically sent if you need message ordering for a specific field
We can guarantee that a particular message having key will go this particular partition based on key hashing

Consumers

Consumers read data from a topic (identified by name)
Consumers know which broker to read from
In case of broker failures, consumers know how to recover
Data is read in order within each partitions

Consumer Groups

Consumers read data in consumer groups
Each consumer within a group reads from exclusive partitions
If you have more consumers tha partitions, some consumers will be inactive
Consumers will automatically use a GroupCoordinator and a ConsumerCoordinator to assign a consumers to a partition

Consumer Groups What if too many consumers ?

If you have more consumers than partitions, some consumers will be inactive

Consumer Offsets

Kafka stores the offsets at which consumer group has been reading
The offsets committed live in a Kafka topic named consumer_offsets
When a consumer in a group has processed data received from Kafka, it should be committing the offsets
If a consumer dies, it will be able to read back from where it left off thanks to committed consumer offsets

Delivery semantics for consumers

Consumers choose when to commit offsets
There are 3 delivery semantics
At most once:
- offsets are committed as soon as the message is received.
- If the processing goes wrong, the message will be lost (it won't be read again).
At least once (usually preferred):
- offsets are committed after the message is processed.
- If the processing goes wrong, the message will be read again.
- This can result in duplicate processing of messages. Make sure your processing is Idempotent (i.e. processing again the messages won't impact your systems)
Exactly once:
- Can be achieved for Kafka => Kafka workflows using Kafka Streams API
- For Kafka => External Systems workflows, use an Idempotent consumer

Kafka Broker Discovery

Every Kafka broker is also called a bootstrap server
That means that you only need to connect to one broker and you will be connected to the entire cluster.
Each broker knows about all the brokers, topics and partitions (metadata)

Zookeeper

Zookeeper manages brokers (keeps a list of them)
Zookeeper helps in performing leader election for partitions
Zookeeper sends notifications to Kafka in case of changes (e.g.new topic, broker dies, broker comes up, delete topics, etc....)
Kafka can't work without zookeeper
Zookeeper by design operates with an odd number of servers (3, 5, 7)
Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads)
(Zookeeper doesn't store consumer offsets with Kafka > v0.10)

Kafka Guarantees

Messages are appended to a topic-partition in the order they are sent
Consumers read messages in the order stored in a topic-partition
With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down
As long as the number of partitions remains constant for a topic (no new partitions), the same key will always go to the same partition

Kafka Installation on windows

Download and Setup Java 8 JDK
Download the Kafka binaries from https://kafka.apache.org/downloads
Extract Kafka at the root of C:\
Setup Kafka bins in the Environment variables section by editing Path
Try Kafka commands using kafka-topics.bat (for example)
Edit Zookeeper & Kafka configs using NotePad++ https://notepad-plus-plus.org/download/
1. zookeeper.properties: dataDir=C:/kafka_2.12-2.0.0/data/zookeeper (yes the slashes are inversed)
2. server.properties: log.dirs=C:/kafka_2.12-2.0.0/data/kafka (yes the slashes are inversed)
Start Zookeeper in one command line: zookeeper-server-start.bat config\zookeeper.properties
Start Kafka in another command line: kafka-server-start.bat config\server.properties

Important: For the rest of the course, don't forget to add the extension .bat to commands being run WINDOWS Users: do not delete topics.

Windows has a long standing bug (KAFKA-1194) which makes Kafka crash if you delete topics, The only way to recover from this error to manually delete the folders in data/kafka`
Kafka Tools UI can be used in place of Kafka CLI, which can be downloaded from http://kafkatool.com/

pgeek92/apacha-kafka

apacha-kafka

Why Apache Kafka

Use cases

Examples

Topics, partitions and offsets

Brokers

Brokers and topics

Topic replication factor

Concept of Leader for a partition

Producers

Message keys

Consumers

Consumer Groups

Consumer Groups What if too many consumers ?

Consumer Offsets

Delivery semantics for consumers

Kafka Broker Discovery

Zookeeper

Kafka Guarantees

Kafka Installation on windows