spark-streaming-direct-kafka
High Performance Spark Streaming with Direct Kafka in Java
What & Why
Simple library provides easy way to consume from Kafka using Spark Streaming. This lib keeps offsets in zookeeper - instead of them stored in HDFS. Since lib stores offsets only once per batch - we can achieve very high throughput.
This is relatively reliable - but there can be still some data loss. But in most scenarios this provide at least once guarantees. We managed to consume over 100,000 messages/ sec using this lib.
How to Run:
This is how you start your job:
spark-streaming-direct-kafka/src/main/java/com/spark/streaming/tools/StreamingEngine.java
Configs are self explanatory and can be changed here:
spark-streaming-direct-kafka/src/main/resources/streaming.yml