Experiments with Hierarchical Data Distribution with Kafka
Ideally, the experiments are executed on two hosts:
host1
runs the Kafka clusterhost2
runs the client
┌───────────────┐ ┌───────────────┐
│ │ │ │
│ host1 │ ssh │ host2 │
│ │◄─────────┤ │
│ Kakfa cluster │ │ Kakfa clients │
│ │ │ │
└───────────────┘ └───────────────┘
The experiments are started on host2
.
When executing the scripts, the address of host1
is specified via the environment variable REMOTEHOST
.
Both hosts are required to have a working Internet connection (host1
: to download the Kafka/Zookeepr Docker images; host2
: to clone this git repo).
On host1
:
- Install Docker Compose, e.g., with Ubuntu:
sudo apt update && sudo apt install -y docker-compose
-
Enable
root
access via SSH fromhost1
. This can be done by creating a new SSH key with no password (ssh-keygen -b 2048 -t rsa -f newkey -q -N ""
) and then copyingnewkey.pub
(public part) in/root/.ssh/authorized_keys
while associatingnewkey
(private part) withhost2
onhost1
-
Make sure that
host2
can create enough SSH connections tohost1
. This depends on the configuration of your system. Tips: raiseMaxSessions
in/etc/ssh/sshd_config
and disable/fine-tune PAM and fail2ban (if installed).
On host2
:
- Clone this repo:
git clone https://github.com/ccicconetti/kafka-hdd.git
- Download the Kafka binaries:
wget -O- https://dlcdn.apache.org/kafka/3.4.1/kafka_2.13-3.4.1.tgz | tar xfz -
- Export the environment variables
KAFKA_DIR
andREMOTEHOST
which are used by some of the scripts (for the latter use the real IP address ofhost2
):
export KAFKA_DIR=$PWD/kafka_2.13-3.4.1
export REMOTEHOST=1.2.3.4
- Install
kafkacat
:
sudo apt update && sudo apt install -y kafkacat
- Verify that all the requirements are met with:
scripts/check_reqs.sh
- Make sure
python3
andpython2
are installed. The former is used by the scripts bundled in this repo, while the latter is used by a script downloaded on demand only for the post-processing of the results.
You can replicate the calibration experiments as follows:
cd graphs
python3 ../allocation/calibration.py
This will produce a number of *.dat
files, which can be plotted with Gnuplot:
gnuplot -persist calibration-P.plt
gnuplot -persist calibration-b.plt
If you use this software in a scientific publication, please cite the following work:
Theofanis P. Raptis, Claudio Cicconetti, Andrea Passarella,
Efficient topic partitioning of Apache Kafka for high-reliability real-time data streaming applications,
Future Generation Computer Systems,
Volume 154, 2024, Pages 173-188,
https://doi.org/10.1016/j.future.2023.12.028.