- A recent version of Confluent Platform installed
- Linux or MacOS operating system
- bash and
awk
command line tool. - A recent version of Java
- Startup the Zookeeper Hierarchical Quorum, each process in a different terminal window.
While not all processes are running, you will see exceptions in the logs, which is fine.
Once all processes are running, no additional exceptions should be logged.
./start-zk-1.sh
./start-zk-2.sh
./start-zk-3.sh
./start-zk-4.sh
./start-zk-5.sh
./start-zk-6.sh
- Inspect the Zookeeper cluster status:
./get-zookeeper-cluster-status.sh
. You should see that all Zookeeper nodes are running, and there is exactly one leader and 5 followers. - Kill Zookeeper1 and Zookeeper4:
kill -9 <PROCESS_ID>
to simulate machine failures. Again inspect the Zookeeper cluster status and see that there are 4 Zookeepers running, exactly one leader and 3 followers. The cluster is still operational, since it tolerates one outage per Zookeeper group. - Restart Zookeeper1 and Zookeeper4. The cluster is now back in normal state.
- Start the Kafka brokers in four additional terminal windows.
You will see exceptions in the logs as long as not all brokers are running.
Once all brokers are up, no additional exceptions should be logged.
You now have a fully functional cluster up and running that simulates a stretched cluster over two data centers.
- ./start-k1.sh
- ./start-k2.sh
- ./start-k3.sh
- ./start-k4.sh
- Start a sample producer and a sample consumer application in two more terminal windows.
Note that the producer is configured with the
acks=all
setting. This will create the topicstest
and__consumer_offsets
../run-producer.sh
./run-consumer.sh
- Inspect the newly created topics:
./describe-topics.sh
. Note that both topics have replication factor 4 andmin.insync.replicas
configuration of 3. - Kill one broker to simulate a machine failure:
kill -9 <PROCESS_ID>
. Note that the producer and consumer continue operating without errors. - Kill another broker to simulate another machine failure.
You should see that the producer is no longer able to produce, since the
min.insync.replicas
setting is no longer fulfilled. This shows that the setup tolerates the outage of one broker, but not two. - Restart the two brokers to bring back the cluster in normal mode.
- Kill all processes located in DataCenter B -- these are Zookeeper 4 5 and 6, as well as Kafka 3 and 4:
./kill-dcB.sh
. Notice that neither the producer nor the consumer are working. - Bring Zookeeper to reduced operational mode:
- Stop all remaining Zookeeper nodes.
- Start Zookeper in simple quorum mode:
- './start-zk-1-degraded.sh'
- './start-zk-2-degraded.sh'
- './start-zk-3-degraded.sh'
- Bring Kafka to reduced operational mode.
Reduce the min.insync.replicas setting for the remaining brokers:
./reduce-min-isr-dc-A.sh
. Note that the producer and consumer are now again functional. In this reduced operational mode we can still tolerate the outage of another Kafka broker and another Zookeeper node. - Bring Zookeeper 4 to 6 and Kafka 3 and 4 back online. Inspect the Zookeeper cluster status and note that only Zookeeper 1 to 3 are currently serving requests.
- Increase the min.insync.replicas
setting back to 3:
./increase-min-isr-dc-A.sh.
In production you should only do this once there are no more under-replicated partitions. This check should be automated. Note that this does not impact the producer or consumer, since there are sufficient insync replicas present. - Bring back Zookeeper to the hierarchical quorum mode by restarting Zookeeper 1 to 3. This will not impact the producer or consumer.
./start-zk-1.sh
./start-zk-2.sh
./start-zk-3.sh
- Check the Zookeeper cluster status to see that all nodes are back serving requests.
- Stop all processes and run the cleanup script:
cleanup.sh
.