Kafka Stream Processing with Bytewax

Processing rides events from Kafka with Bytewax


    This repo contains a basic event processing example. The idea was to play around with Bytewax for event processing from Kafka. Bytewax is a lightweight, easy to install, and easy to deploy tool for real-time pipelines.


Contents

Expand
  1. About The Project
  2. Getting Started
  3. License
  4. Acknowledgments

About The Project

    Making real-time data pipelines is really very challenging, since we have to deal with data volume changing, latency, data quality, many data sources, and so on. Another challenge we usually face is related to the processing tools: some of them can't process event-by-event, some of them are too expensive in a real fast scenario, and the deployment of some of them are really challenging. So, in order to better make decisions about our data stack, we have to properly understand the main options availables for storing and processing events.

    In this way, the purpose of this repo is to get in touch with some stream processing capabilities of Bytewax. To achieve this goal, we ran a Kafka cluster in Docker and then produced fake ride (something like Uber's ride) data to a topic. Then, a Bytewax dataflow consumed those records, performed some transformations, and then saved the refined data into another topic.

(back to top)

Business Context

    In order to be more business-like, we created a fake business context.

    Our company KAB Rideshare is a company that provides ride-hailing services. Due to a recent increase in the number of users, KAB Rideshare will have to improve their data architecture and start processing the rides events for analytical purposes. So, our Data Team was given the task of implementing an event processing step. The company already stores all the events in a Kafka cluster, so our task is just to choose and implement the processing tool.
    After some time of brainstorming, we decided to start with a MVP using a lightweight, but very powerful tool: Bytewax. In this, way, we can get the results quickly so the analytical team can do their job. Once we have gained this time, we can better evaluate other solutions.

(back to top)

Development Plan

  • Implement the Data Gen
  • Deploy a Kafka cluster on Docker for testing purposes
  • Implement the Bytewax Dataflow

(back to top)

Built With

(back to top)

Solution Description

    We'll shortelly describe the workflow so you can better understand it.

    First of all, we needed some data. In this way, we developed a simple data generator using Python that loads data into a Kafka topic. This generator was based on a more complex one we developed in Engenharia de Dados Academy. The data contains some columns about source, destination, ride price, user id and stuff like that.

    The next step was the resource provisioning. Using Docker Compose, we ran a Kafka cluster and Apache Pinot locally. Finally, we developed the bytewax processing. It performs the following computations:

  1. miles to kilometers convertion
  2. USD to BRL convertion
  3. add processing time
  4. add event time (convert timestamp to datetime string)
  5. add dynamic fare flag
  6. filter desired columns

    As an aditional step we ran queries on pinot for checking the processed events from Bytewax.

(back to top)

Results Found

    Bytewax seems very interesting. With a simple worker running in pure Python, it was capable of processing an average of 500 events per second, as shows the image bellow:

    Although we implemented a very simple example here, Bytewax's processing capacity is actually quite interesting.

(back to top)

Next Steps

From now on, we'll deploy both Kafka, bytewax, and the data generator to a Kubernetes cluster so we can better explore real world scenarios and the Bytewax scaling capabilities.

(back to top)

Getting Started

    You'll find in this section all the steps needed to reproduce this solution.

Quick tip

    I'm using Gitpod to work on this project. Gitpod provides an initialized workspace integrated with some code repository, such as Github. In short, it is basically a VScode we access through the browser and it already has Python, Docker, and some other stuff installed, so we don't need to install things in our machine. They offer a free plan with up to 50 hour/mo (no credit card required), which is more than enough for practical projects (at least for me).

(back to top)

Prerequisites and Installations

For other Python's dependencies, please refer to the `requirements.txt`, there we have all dependencies we needed.

(back to top)

Reproducing

  1. Clone the repo:

    git clone https://github.com/KattsonBastos/kafka-bytewax-streaming.git
  2. Create a Python's virtual environment for Bytewax (I'm using virtualenv for simplicity) and activate it:

    python3 -m virtualenv env
    source env/bin/activate
  3. Install Python's packages with the following command:

    pip install -r requirements.txt
  4. Provision the resources (Kafka and Pinot):

    cd src/build/
    docker compose up -d
  5. Create Kafka Topics:

    We'll need two topics: `input-raw-rides`, for events from the data generator, and another for refined events `output-bytewax-enriched-rides`. First, we need to access the container:

    docker exec -it broker bash
    

    Then, create both topics:

    kafka-topics --create --topic=input-raw-rides --bootstrap-server=localhost:9092 --partitions=3 --replication-factor=1
    
    kafka-topics --create --topic=output-bytewax-enriched-rides --bootstrap-server=localhost:9092 --partitions=3  --replication-factor=1
    
    
  6. Start the data generator:

    cd src/data-gen-topic/
    
    bash run.sh input-raw-rides

    Remember that `input-raw-rides` is the the topic we want to save the events into.

  7. Start the Bytewax dataflow (remember to activate the env created in the second step):

    python3 -m bytewax.run dataflow

    If you want more workers, run the following command (refer to the Bytewax docs for more):

    python3 -m -w bytewax.run dataflow

    Here you can play with Bytewax for checking its performance.

  8. Check the refined events Here you can do one or many (it's up on you):

    • go to the kafka Control Center (port 9021) and take a look at the topic;

    • access the container and run the following command:

      kafka-console-consumer --bootstrap-server=localhost:9092 --topic=output-bytewax-enriched-rides
    • Or you can create a table in apache pinot and play with the data. To do so, just run the script `src/pinot/run.sh` and then check the table in Pinot Controller (port 9000). If you prefer, you can also run the python script `src/pinot/query.py`. Make sure to replace the query with yours.

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

(back to top)