Documentation on PyKafka vs kafka-python
microamp opened this issue Β· 17 comments
Hello. I'd like to play around with Kafka, but I don't know which client to use to start with. I know there is at least one other Python client called kafka-python. I wonder if there is any documentation on comparison between the two. I'll start with PyKafka in the meantime. :)
@microamp Thanks, this is a great idea. There's currently no documentation on this, but to my knowledge the main differences are the specifics of the Python API and PyKafka's implementation of the BalancedConsumer
. PyKafka strives to keep the API as pythonic as possible, which means using useful features of the language where appropriate for client code simplicity. This includes things like context managers for object cleanup and futures for asynchronous error handling. PyKafka's balanced consumer implements the Kafka project's notion of the "high level consumer", which uses ZooKeeper to balance consumption of partitions between multiple nodes in a consumer group. From what I understand, kafka-python is waiting until Kafka 0.9, when this functionality will be supported natively by the Kafka server itself, to implement self-balancing consumers.
Also, the last time we did a speed test (which was admittedly a while ago at this point), PyKafka's consumer outperformed kafka-python. I unfortunately no longer have the results from that test, so you may not want to bet too hard on PyKafka being significantly faster or slower - just figured I'd mention it.
Some more research - there are differences in the versions of python supported by each library. PyKafka supports 2.7, 3.4, 3.5, and pypy, while kafka-python adds 2.6 and removes 3.5 support. kafka-python also requires a ZooKeeper connection for offset management, which PyKafka does not. kafka-python supports versions of Kafka from 0.8.0 to 0.8.2, where PyKafka only supports 0.8.2.
@emmett9001
Thanks a lot for the reply. I find the information very helpful.
It's good to know that PyKafka supports Python 3.4+. It was still work in progress the last time I checked a few months back. Good work guys.
A difference between kafka-python and pykafka is the producer interface. kafka-python does not require that you know the topic when instantiating the producer. This is convenient if you need to produce to topics dynamically based on input (which I do!) :)
@ottomata That seems like an interesting request for us to look at. Want to open a separate issue about that?
Sure!
@emmett9001 @ottomata Just got pointed at this thread and thought I'd make a late contribution.
We compared pykafka and kafka-python about 2 months ago while trying to decide which one to use. In the end, the deciding factor for us was that balanced consumers were much easier to manage in pykafka.
Also, we discovered later, a pykafka producer doesn't die on Kafka broker restart, while our kafka-python producers did.
Below are performance figures from a 3-node Kafka cluster running in EC2, using a single producer or consumer. The three numbers for each test are the quartiles measured for the test.
- pykafka producer: 41400 β 46500 β 50200 messages per second
- pykafka consumer: 12100 β 14400 β 23700 messages per second
- kafka-python producer: 26500 β 27700 β 29500 messages per second
- kafka-python consumer: 35000 β 37300 β 39100 messages per second
So, for clarification, the median performance of a pykafka producer was 46500 messages per second, with a quartile range of 41400 (25th percentile) to 50200 (75th percentile). Hope that makes sense.
This is awesome, thanks for the performance numbers @cscheffler. Do you have anything to share on the methodology you used to find them?
Cool! For the producer bench, did you just use the default parameters? I assume async with req_acks=1?
@cscheffler can you please share the links to the test scripts, if they are open-sourced? I see https://github.com/cscheffler/kafka-demo which uses pykafka. It would be great help if you can share the test scripts for kafka-python that were used in your comparison. Thanks!
This writeup by @jofusa is the most thorough comparative benchmark of the python kafka clients I've seen.
Leaving a url of another benchmark done recently between pykafka 2.3.1, kafka-python 1.1.1, and confluent-kafka 0.9.1
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Edit: already mentioned above by @emmett9001
original author here. just a fyi those are one and the same
It's Jul, 2017 is there any new update and a recent comparison?
I think now even kafka-python supports the balanced consumers.
It's Jul, 2019! Any updates on the comparison? :)
It's April, 2020! Newbe here, what i want to find is which one is friendly for us?
Itβs Sept, 2020! Anything update?