/1brc_streaming

1brc challenge with streaming solutions for Apache Kafka

Primary LanguageC

1brc challenge with streaming solutions for Apache Kafka

Inspired by original 1brc challenge created by Gunnar Morling: https://www.morling.dev/blog/one-billion-row-challenge

⚠️ This is still a WIP project

⚠️ This challenge does not aim to be competitive with the original challenge. It is a challenge dedicated to streaming technologies that integrate with Apache Kafka. Results will be evaluated taking in consideration complete different measures.

Pre requirements

  • docker engine and docker compose
  • about XXGB free space
  • challenge will run on these supported architectures only:
    • Linux - x86_64
    • Darwin (Mac) - x86_64 and arm
    • Windows

Simulation Environment

  • Kafka cluster with 3 brokers. Cluster must be local only. Reserve approximately XXGB for data.
  • Topic with 32 partitions, replication factor 3 and LogAppendTime named data for input
  • Topic with 32 partitions, replication factor 3 named results for output
  • Kafka cluster must run using the script run/bootstrap.sh from this repository. bootstrap will also create input and output topics.
  • Brokers will listen on port 9092, 9093 and 9094. No Authentication, no SSL.

Rules

  • Implement a solution with kafka APIs, kafka streams, flink, ksql, spark, NiFi, camel-kafka, spring-kafka... reading input data from data topic and sink results to results topics. and run it!. This is not limited to JAVA!

  • Ingest data into a kafka topic:

    • Create 10 csv files using script run/data.sh or run/windows/data.exe from this repository. Reserve approximately 19GB for it. This will take minutes to end.
    • Each row is one data in the format <string: customer id>;<string: order id>;<double: price in EUR>, with the price value having exactly 2 fractional digits.
    ID672;IQRWG;363.81
    ID016;OEWET;9162.02
    ID002;IOIUD;15017.20
    ..........
    
    • There are 999 different customers
    • Price value: not null double between 0.00 (inclusive) and 50000.00 (inclusive), always with 2 fractional digits
    • Read from csv files AND send continuously data to data topic using the script producer.sh from this repository
  • Output topic must contain messages with key/value and no additional headers:

    • Key: customer id, example ID672
    • Value: order counts | order counts_with_price > 40000 | min price | max price, example 1212 | 78 | 4.22 | 48812.22 grouped by key.
    • Expected to have 999 different messages

💡 Kafka Cluster runs cp-kafka, Official Confluent Docker Image for Kafka (Community Version) version 7.6.0, shipping Apache Kafka version 3.6.x

💡 Verify messages published into data topic with run/consumer.sh script using https://raw.githubusercontent.com/confluentinc/librdkafka/master/examples/consumer.c. Tu run the consumer, verify that you have installed librdkafka

How to test the challenge

  1. Run script run/data.sh or run/windows/data.exe to create 1B rows split in 10 csv files.
  2. Run script run/bootstrap.sh to setup a Kafka clusters and required topics.
  3. Deploy your solution and run it, publishing data to results topic.
  4. Run script run/producer.sh in a new terminal. Producer will read from input files and publish to data topic.

At the end clean up with script run/tear-down.sh

How to participate in the challenge

  1. Fork this repo
  2. Add your solution to folder challenge-YOURNAME, example challenge-hifly
  3. Open a Pull Request detailing your solution with instructions on how to deploy it

✅ Your solution will be tested using the same docker-compose file. Results will be published on this page.

💻 Solutions will be tested on a (TODO) server

💡 A sample implementation is present in folder challenge with Kafka Streams. Test it with:

cd challenge
mvn clean compile && mvn exec:java -Dexec.mainClass="io.hifly.onebrcstreaming.SampleApp"