/kafka-stream-pipelines

In traditional streaming applications the data flow needs to be known either at build time or at startup time. With this approach the wireing is done completely at runtime.

Primary LanguageJava

Kafka Streaming Pipeline

The idea ist to enable stream procesing while all the transformations are applied, changed or removed at runtime. And every trnsformation is just a REST like service. This way one can develop and maintian the not only the whole data pipeline but also every transition logic with 0 downtime.

Bolting Pipelets

Basically we have a Kafa cluser where we subscribe to a specific topic, then push forward the event to a rest api and finally sent the rest result to a target topic. such kind of stateless functions are also called serverless functions or thanks to amazon lambdas.

Now imaginge a sequence of such labdas - this is what we call a pipeline. A lamda + the definition of the pipeline and the topics is what we call a pipelet. Now to attach one pipelet to another we bolt them together.

|===||===||===||===||===||===||===||===||===||===||===||===||===|

Demo

  • compile demo ./gradlew clean :demo:build
  • run the demo java -jar demo/build/libs/demo-1.0-SNAPSHOT.jar
    • this starts an embedded kafka cluster
    • and the bolting server
  • browse to http://localhost:8080/
  • add "demo.returs" topic as a "bar" chart
  • open up a new console and run ./gradlew :demo:startSource
  • you should see the returns chart updating on the ui
  • add the topic "demo.performance" as a "line" chart
  • run curl and bolt demo service to source
curl -H "Content-Type: text/plain" \
-X POST 'http://localhost:8080/api/v1/bolt/demo-pipeline/demo-service-1/demo.returns/demo.performance/GET' \
-d 'http://localhost:4567/demo/fold?key=${event.key}&value=${event.value}&state=${state}&offset=${state.nextConsumerOffset()}'
  • run ./gradlew :demo:startService in yet another new terminal window
  • the performance chart should update in sync with the returns chart

now we can do some stuff like

  • kill and restart demo service without dataloss
  • persist the pipeline and also allow to restart the bolting service (comming up next)
  • replace service by a different one (comming up next)

Development

  • For the react frontend we have a webpack development server: ./gradlew :demo:frontend:start --no-daemon
  • We have embedded a kafa server into the demo module. During development boot up time with the full embedded kafka is too slow so we can start a kafka server during development like so: ./gradlew :embedded-kafka:start
  • A demo source can be started with: ./gradlew :demo:startSource
  • A demo service can be started with: ./gradlew :demo:startService

TODO

  • persist pipelets and state and test server restart.
  • sources: just like pipelets we also want to discover sources available for the bolting machine. All sources should also be backupped by some database so that we can replay whole pipelines if neccessary
  • versioning of topics: if we repplace a pipelet we want to re-run all the topics depending on the replaced target topic. To do so we need to have some kind of versioned topics.

Ideas