These project several micro benchmarks that measure sustainable throughput, latency; CPU, memory & bandwidth consumption of different batch & stream join approaches.
Author: Rafael Moczalla
Create Date: 19 July 2022
Last Update: 26 July 2022
Tested on Ubuntu 22.04 LTS.
-
Install git, a java JDK, Docker & Gradle.
sudo apt install gradle default-jdk-headless docker-ce curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk install gradle 7.5
-
Install Docker Compose.
sudo curl -SL https://github.com/docker/compose/releases/download/v2.6.1/docker-compose-linux-x86_64 -o /usr/local/bin/docker-compose
-
Download the project & change directory to the project folder.
git clone https://github.com/rafaelmoczalla/TBD.git cd TBD
To run the examples you first need to have a running Spark cluster where you can submit the map reduce job. Then you build the project & afterwards you submit one of both join examples as a job to the map reduce cluster.
We use the Docker Spark cluster setup provided in the ./environment
subproject. To start the local cluster open a new terminal in the ./environment
folder & start the cluster with Docker Compose as follows
cd environment
gradle clean & gradle build
make startCluster
Make sure the gradle.properties
in both project are identical.
The project is build with Gradle & split into a source subproject & an actual join subproject. When you first start the project or when some files are missing you need to run the following command in the project directory.
gradle build
When you change the configuration in any of the gradle.properties
file, or you added a new template file via the gradle.build
file you need to do a clean rebuild of the project with the following command.
gradle clean & gradle build
Be careful as all files generated from template files are deleted & rebuild.
To build only the sources enter
gradle :source:build
into the terminal & to build only the join job enter
gradle :distributed-join:build
into the terminal.
Before starting the actual micro benchmark we need to start the sources. We prepared a make target for that task.
make startSources
After starting the sources we can submit & start the join with
make submitJob
- Implement basic source.
- Implement basic join.
- Add a "measuring" subproject providing tools for measuring sustainable throughput & latency.
- Implement a first micro benchmark.