Comparison with Kafka
Anmol-Singh-Jaggi opened this issue · 5 comments
Might be possibly stupid question, but why not to just use Kafka?
In Kafka too, we can have a single producer producing data and multiple consumers consuming that data.
The only difference I can see is that in Kafka, the data passes through a rather elaborate broker system; Producer -> Broker -> Consumer.
Whereas in Hollow, its directly Producer -> Consumer.
I used hollow in the past and there are several things IMO:
-
de duplication: hollow will de-dup and reuse types with the same values to reduce memory footprint. Example, if you have a enum, it will store a string in memory and re use it. You wouldn't get that with a Kafka consumer as each message is unique and just one string field for the whole body. And also you will have to parse it. You might end up using more memory.
-
hollow produces snapshots and deltas which could be used as versioning of a dataset state. If something goes wrong, like bad data is written, consumers can easily pin to a specific point in time. With Kafka consumers achieving this is not possible, at least not out of the box
-
with hollow, consumers have the whole dataset in memory, If you have a type for
Car
for example and you have 1k records, all the consumers will load that in memory. In Kafka, consumers retrieve from a set of partitions which only allows them to keep data from partitions that they consume, and that will change when rebalancing kicks in because a new instance was added to the group or deleted. In both cases, each consumer will have different records. Also, with this approach, partitions are your scale limit. Nowadays you can use Kafka streams to store all data inGlobalKTable
or doKTable
and introducegRPC
to query the data between instances. Again you don’t have de-dup here. Another option in Kafka I guess would be start each consumer with different consumer group Id so they consume all partitions... if your topic is big enough, it might take some time to ingest all the data in a single instance -
You can query data in hollow via primary key, compound keys, hash indices and prefix indices: https://hollow.how/indexing-querying/. With Kafka out of the box you would only get message key lookup. You can get around that by using Kafka streams to transform messages as they come in and create keys for search ability and store in a
KTable
or use KSQL. In both cases, you need to implement to your needs. -
multi region: pushing snapshots/deltas to a blob store in a multi region environment it’s definitely easier than multi region Kafka. If you can open traffic between regions for your Kafka cluster, then you might be fine. If not, you get into replication challenges. Mirrormaker wasn’t great for this. I’m sure by now confluence should have good tooling for it.
@Sunjeet @toolbear @akhaku @dkoszewnik
Hey folks, do you have something that you could add around this? Also this is from my experience ~2 years ago
I think you pretty much covered it! We get the same question internally sometimes (what's the difference between Kafka and Hollow/Gutenberg) and my answer is that yes you can technically implement it using Kafka as an underlying mechanism but then you're re-implementing the features that Hollow provides.
Hi @Anmol-Singh-Jaggi , let me know if what I share makes sense. Always happy to chat about this. Otherwise, feel free to close the issue 😉
Thanks for so much info! :)