This project demonstrates a distributed data processing solution using Apache Kafka and its Kafka Streams API. The primary goal is to perform a GroupBy operation on data from a CSV file with two columns.
- Distributed data processing
- Utilization of Kafka Streams for data aggregation
- Modular architecture for enhanced scalability
- Spring boot
- Apache Kafka
- Kafka Streams API
- Docker (for Kafka server)
The program is divided into the following modules:
-
kafka-distributer: This module serves as the entry point for our project. Through the command line, we can specify the file for which we want to perform a GroupBy operation. For splitting, we utilize a separate splitter, which can be found in another repository at link-to-splitter-repository.
-
kafka-distributor-split-to-stream: This module is responsible for the distributed reading of a file. It receives messages in the format
{leftIndex:number_left_index, rightIndex:numberIndex, path:pathToFile}
as input and generates messages for thestream-input
topic. -
kafka-distributor-stream-to-group-by: This module generates the resulting GroupBy using Kafka Streams. It retrieves data from the
stream-input
stream and aggregates the result into thestream-output
topic. Thekafka-distributer
client module receives the final result from thestream-output
topic.
The kafka-distributor-split-to-stream
and kafka-distributor-stream-to-group-by
modules are designed for extensibility to provide a distributed processing system. This program serves as an example solution to demonstrate distributed processing capabilities.
- Writing tests for modules
- Clone this repository.
- Set up Docker to run the Kafka server using the provided
docker-compose.yml
file.
- Start the Kafka server using Docker.
- Run the
kafka-distributor-split-to-stream
module. - Execute the
kafka-distributor-stream-to-group-by
module. - Launch the
kafka-distributer
module, specifying the path to the CSV file or using the provided example in theresources
folder.