/kafka-to-parquet

Tool to read records from kafka and write to parquet

Primary LanguageRust

kafka-to-parquet

This is a tool to redirect all the records from a Kafka topic to parquet files.

Just a pet project to play with Rust :)

How this works?

Spawns a thread per partition (using Tokio) to consume all the records from the lowest watermark to the highest one of the topic specified.

Each thread writes a batch of N records per parquet file adding an incremental suffix to the filename.

To-do

Some features I'd like to add eventually:

  • Decide a proper name
  • Retrieve min max offsets from the topic (per partition) to set the consumer boundaries.
  • Kafka multi thread for multi partition topics.
  • AVRO deserializer using schema registry