Kafka distributed csv-parser

🔎 Overview 🔎

This project demonstrates a distributed data processing solution using Apache Kafka and its Kafka Streams API. The primary goal is to perform a GroupBy operation on data from a CSV file with two columns.

🗂️ Table of Contents 🗂️

Features
Technologies Used
Application architecture
Project Roadmap and Issues
- Current Issues
Getting Started
- Installation
- Running the Application

Features

Distributed data processing
Utilization of Kafka Streams for data aggregation
Modular architecture for enhanced scalability

Technologies Used

Spring boot
Apache Kafka
Kafka Streams API
Docker (for Kafka server)

Project Modules

The program is divided into the following modules:

kafka-distributer: This module serves as the entry point for our project. Through the command line, we can specify the file for which we want to perform a GroupBy operation. For splitting, we utilize a separate splitter, which can be found in another repository at link-to-splitter-repository.
kafka-distributor-split-to-stream: This module is responsible for the distributed reading of a file. It receives messages in the format {leftIndex:number_left_index, rightIndex:numberIndex, path:pathToFile} as input and generates messages for the stream-input topic.
kafka-distributor-stream-to-group-by: This module generates the resulting GroupBy using Kafka Streams. It retrieves data from the stream-input stream and aggregates the result into the stream-output topic. The kafka-distributer client module receives the final result from the stream-output topic.

The kafka-distributor-split-to-stream and kafka-distributor-stream-to-group-by modules are designed for extensibility to provide a distributed processing system. This program serves as an example solution to demonstrate distributed processing capabilities.

Project Roadmap and Issues

Current Issues

Writing tests for modules

Getting Started

Installation

Clone this repository.
Set up Docker to run the Kafka server using the provided docker-compose.yml file.

Running the Application

Start the Kafka server using Docker.
Run the kafka-distributor-split-to-stream module.
Execute the kafka-distributor-stream-to-group-by module.
Launch the kafka-distributer module, specifying the path to the CSV file or using the provided example in the resources folder.

borumv/kafka-distributed-parser