tychedelia/kafka-protocol-rs

Zero copy message decoding

thedodd opened this issue · 1 comments

This would add a next level of performance to parsing incoming Kafka requests. The main idea:

  • Request payloads would be parsed / validate in a way which is not too dissimilar to how it is currently done in this crate as of 0.8.x.
  • Instead of allocating new collections to produce owned copies of the decoded messages, instead we would produce messages which can borrow data from the backing Bytes buffer.
  • This pattern of always expecting a backing Bytes buffer will be quite nice, because then the type signatures for the zero-copy types will not need to be generic over lifetimes, instead they will simply embed the Bytes buffer.
  • There are a few more difficult patterns which we will have to tackle, indexmaps, vectors, things of that nature; however, a lot of the work could likely be amortized:
    • The zero-copy message types could embed state where needed. Offsets into the buffer. Version info. Things of that nature.
    • Amortizing lookups and offsets will be much less expensive that copying data and allocating storage.
  • BONUS: support direct mutation of data without having to copy. This would per particularly helpful in cases where record offsets need to be updated, and things of that nature.

Other projects which have explored this space:

One thing that could help bypass a lot of the difficulty with alignment and the like: just use accessors to access data. Don't attempt to build structs which are backed by the buffer. Instead, access fields of data via methods on a struct which simply embeds the Bytes buffer. Definitely still edge cases and things to work through; however, that alone will bypass a large portion of alignment issues.

Thoughts?

At one point, I was working on a proxy that could benefit from this approach, but ultimately decided I just needed to parse the header which is pretty straight forward. I'd be curious to know how much overhead we currently have parsing and constructing messages. And what the use cases of our users are and whether they'd benefit from such an API. I think I'd want such a project to be driven by a production user to make sure any improvements were worth the complexity cost.