1brc challenge with streaming solutions for Apache Kafka

Inspired by original 1brc challenge created by Gunnar Morling: https://www.morling.dev/blog/one-billion-row-challenge

⚠️ This is still a WIP project

⚠️ This challenge does not aim to be competitive with the original challenge. It is a challenge dedicated to streaming technologies that integrate with Apache Kafka. Results will be evaluated taking in consideration complete different measures.

Pre requirements

docker engine and docker compose
about XXGB free space
challenge will run on these supported architectures only:
- Linux - x86_64
- Darwin (Mac) - x86_64 and arm
- Windows

Simulation Environment

Kafka cluster with 3 brokers. Cluster must be local only. Reserve approximately XXGB for data.
Topic with 32 partitions, replication factor 3 and LogAppendTime named data for input
Topic with 32 partitions, replication factor 3 named results for output
Kafka cluster must run using the script run/bootstrap.sh from this repository. bootstrap will also create input and output topics.
Brokers will listen on port 9092, 9093 and 9094. No Authentication, no SSL.

Rules

Implement a solution with kafka APIs, kafka streams, flink, ksql, spark, NiFi, camel-kafka, spring-kafka... reading input data from data topic and sink results to results topics. and run it!. This is not limited to JAVA!
Ingest data into a kafka topic:
- Create 10 csv files using script run/data.sh or run/windows/data.exe from this repository. Reserve approximately 19GB for it. This will take minutes to end.
- Each row is one data in the format <string: customer id>;<string: order id>;<double: price in EUR>, with the price value having exactly 2 fractional digits.
```
ID672;IQRWG;363.81
ID016;OEWET;9162.02
ID002;IOIUD;15017.20
..........
```
- There are 999 different customers
- Price value: not null double between 0.00 (inclusive) and 50000.00 (inclusive), always with 2 fractional digits
- Read from csv files AND send continuously data to data topic using the script producer.sh from this repository
Output topic must contain messages with key/value and no additional headers:
- Key: customer id, example ID672
- Value: order counts | order counts_with_price > 40000 | min price | max price, example 1212 | 78 | 4.22 | 48812.22 grouped by key.
- Expected to have 999 different messages

💡 Kafka Cluster runs cp-kafka, Official Confluent Docker Image for Kafka (Community Version) version 7.6.0, shipping Apache Kafka version 3.6.x

💡 Verify messages published into data topic with run/consumer.sh script using https://raw.githubusercontent.com/confluentinc/librdkafka/master/examples/consumer.c. Tu run the consumer, verify that you have installed librdkafka

How to test the challenge

Run script run/data.sh or run/windows/data.exe to create 1B rows split in 10 csv files.
Run script run/bootstrap.sh to setup a Kafka clusters and required topics.
Deploy your solution and run it, publishing data to results topic.
Run script run/producer.sh in a new terminal. Producer will read from input files and publish to data topic.

At the end clean up with script run/tear-down.sh

How to participate in the challenge

Fork this repo
Add your solution to folder challenge-YOURNAME, example challenge-hifly
Open a Pull Request detailing your solution with instructions on how to deploy it

✅ Your solution will be tested using the same docker-compose file. Results will be published on this page.

💻 Solutions will be tested on a (TODO) server

💡 A sample implementation is present in folder challenge with Kafka Streams. Test it with: