linkerd/linkerd2

Kafka metrics support

grampelberg opened this issue ยท 26 comments

What problem are you trying to solve?

HTTP based traffic is only one type of modern applications. Many use message queues such as Kafka. Getting the metrics for consumers/producers/messages are just as critical to application health as requests and responses in HTTP.

How should the problem be solved?

Implement a Kafka codec that allows the Linkerd proxy to introspect Kafka's protocol and provide metrics for the communications between consumers and producers. This should show up as a CLI command, dashboard and visualization of the topology between message consumers and HTTP actors.

Related Issues

Hi,
I am interested to work on this issue and contribute to it as a GSoC project.
Although I am new to Go and Rust programming languages, I do have a prior experience working with Kafka and have a good understanding of the it's architecture and working.

It would be great if I can get some help in getting started.

@cAbhi15 you bet! The first step would be to start getting to know the codebase. This one is primarily on the rust side of things, so you'll want to check out how the codecs work with tokio. You'll also want to dig into the metrics and how that all works! @hawkw have any good pointers for good first steps?

Before you end up implementing anything or throwing up a PR, it'd be great to see a design doc. I can work with you on that and the structure of it.

hawkw commented

It's probably worth looking into previous attempts at writing a futures-based async Kafka codec in Rust; it looks like there's currently a prototype implementation but it doesn't seem complete or production-ready.

The codec implementation is probably going to comprise a majority of the work on this project --- integrating it with the proxy and adding metrics is fairly simple in comparison. To avoid having to do the whole thing from scratch, we might want to determine whether that's something that we can use, perhaps involving a conversation with the maintainers about what's necessary before the library is ready for production use?

A familiarity with async programming in Rust is going to be vital, so learning Rust & getting to know the proxy codebase are probably good first steps if this isn't something you're already familiar with. Any proxy issues with the good first issue tag might be good starting points.

hawkw commented

@zaharidichev The linkerd2_proxy::proxy::http module mostly consists of HTTP-specific proxy behaviours; the actual protocol implementation and codecs are provided by libraries.

Linkerd 2 uses the library hyper (maintained by @seanmonstar), which implements clients and servers for HTTP/1 and HTTP/2. As @grampelberg mentioned, the HTTP/2 protocol implementation is h2 (maintained by @carllerche) for HTTP/2, which hyper uses to support HTTP/2.

@hawkw Thanks for the information. This is indeed quite an interesting issue.

Hi, I am also interested in this project for GSoC 2019. Can anyone share a project proposal template? Thank you

Same, for me. I am interested in the project!

@Nethminiromina we don't have a specific template, a writeup on what you're planning on doing and implementation details would help a ton.

I was checking the code and I share some impressions to help. I understand linkerd-proxy implements all the business logic but to work with grpc (tower-grpc) and protobuf / proto, it exists linkerd-proxy-api.
Furthermore, I found an useful example for tower-grpc that I refactorized it to set up a template, my example.

ruig2 commented

Hi, I'm a PhD student at the University of California, Irvine and I'm also interested in this project. I went through the codes and documents of Linkerd2 and Kafka briefly, and my idea about this project is as following:

Target
Enable Linkerd2 to proxy messages that follow the Kafka protocol

Current condition

Plan to solve the issue
Since 1) Linkerd2 can proxy TCP and 2) Kafka is based on TCP, we only need to add a Kafka decoder / codec to decode the bytes so that we can analyze and provide metrics accordingly. Our first choice would be to use a sophisticated 3rd party decoder.

Tools/libraries that can help

Code-level plans

  • Check the source code to get to know how HTTP proxy works will be helpful (linkerd2_proxy::proxy::http::h2.rs for HTTP2). Basically, we want to know: 1) the code structure of Linkerd2 (e.g. how Linkerd2-proxy repo. is used in Linkerd2 repo.); 2) how to add a new protocol type; 3) the run-time pipeline of Linkerd2 to recognize the protocol and inject a proxy function;
  • We can add a new kafka.rs file under linkerd2_proxy::proxy to implement the decoder logic;
  • When the decoder is ready, we can add it as a new protocol type and let Linkerd2 recognizes it (maybe add a new entry at https://github.com/linkerd/linkerd2-proxy/blob/master/src/proxy/server.rs#L291 ).
  • Add CLI tool, web dashboard and metrics accordingly.

@grampelberg and @hawkw Please correct me if anything wrong. Any ideas or comments?

Hi @ruig2! Thanks for getting involved.

For the sake of prototyping, it might be possible to pull in an existing kafka client, but for anything that we'd want to accept into master, we'd want to use something that fits more tightly into our runtime.

Specifically, we need a Rust-based Kafka codec (i.e. no unsafe code) that exposes Kafka's messaging primitives over tokio's async i/o & execution model. (If you want to see an especially complex protocol implementation, check out h2).

I'm not too familiar with Kafka's protocol details, but hopefully we should not need to implement any of the "complicated" kafka features in the proxy--those can be provided by the app's kafka clients--but we need to be able to receive decoded kafka messages from a tokio server and be able to send them on a tokio client.

I expect that the existing HTTP proxying (and metrics, routing, etc) logic will not be generally reusable with for kafka proxying... though we'll also want to fit into tower, layers which is (effectively) how we compose linkerd's internals as a set of layers... See, for example, how we construct clients to the control plane.

Then, we can implement a metrics middleware that sits in a kafka-stream-forwarding stack that looks something like the above.

Once we know what the codec library looks like, it should be a lot easier to describe how a metrics layer should be implemented.

Hope this helps.

ruig2 commented

@olix0r Thanks for your detailed response. Now two things on top of my mind:

  1. Read roughly about Kafka protocol, and evaluate/verify the Rust-based Kafka clients to see which one we can adopt;
  2. Check Linkerd2 codes to know how to call the Kafka decoder in Linkerd2 and how to do metrics when proxying.

Maybe my next step would be to do a prototype (as you mentioned earlier, maybe with a not-so-safe decoder) and document it in my proposal to increase my chance to be admitted into the GSoC program. Any idea on how I can do better?

@ruig2 When you start going down the path of figuring out how to prototype with tokio, the tokio gitter is a great resource.

ruig2 commented

@olix0r @grampelberg @hawkw Good news: I've implemented a prototype to recognize Kafka request header via Tokio.

When you connect a Kafka client to the prototype, you can see something similar to the following:

Received a connection

bytes: [0, 0, 0, 20, 0, 18, 0, 2, 0, 0, 0, 6]
Request Size 20
Request api_key 18
Request api_version 2
Request correlation_id 0

where api_key 18 means ApiVersions Request (https://kafka.apache.org/protocol#The_Messages_ApiVersions).

The related Rust code of the prototype is at https://github.com/ruig2/kafka-codec/blob/master/tokio/examples/detect_kafka.rs#L41-L45 .

I'll polish and document how to run the prototype and the Kafka client later. Currently, the prototype won't do anything but read the first 12 bytes in the stream and decode it according to the Kafka protocol. Lots of work to do (e.g. forwarding the TCP traffic to the Kafka server, handling the case when it is not Kafka, ...), and this prototype can be a very good start to me.

ruig2 commented

A draft of my proposal is created at https://docs.google.com/document/d/1cQeQ5GMyTFAuytygGE281OJJjsLunUImZNm_uppKg-Q/edit . Looking forward to your comments, and feel free to tell me if more information is needed.

Is there any update or movement on this issue? We are migrating workloads involving meshed Kafka clients and have just started investigating possibilities.

ruig2 commented

@halcyondude The report of the GSoC project is at https://github.com/ruig2/KafkaRustCodec/blob/master/gsoc.md . I did some investigation about meshed Kafka clients this summer, however, it is not yet ready for the production environment.

Please let me know if I can help ;)

loxp commented

Hi, I am interested in this issue for Community Bridge 2020. Is there any preliminary task for it?

@loxp you'll want to jump into slack and get an RFC together. We'd love to get to know you a little better too, if you could work through a getting started issue that'd be awesome!

Hi, Is there any updates on this topic?

ruig2 commented

@JTarball Hi, this project was part of the Google Summer of Code 2019. The related codebase is at https://github.com/ruig2/KafkaRustCodec and hope this is helpful to you.

rewt commented

@JTarball Hi, this project was part of the Google Summer of Code 2019. The related codebase is at https://github.com/ruig2/KafkaRustCodec and hope this is helpful to you.

Do you have any docs on how to implement?

Hi all. I've started maintaining a full implementation of the Kafka protocol and am going to try to tackle this as a side project to level up my Rust.

I've just started exploring the proxy codebase, so don't have a ton of thoughts right now, but here are some outstanding questions I'm pondering:

  • Whether Kafka metrics can just be wrapped around existing TCP services. In order to get useful metrics, we need to frame the stream in order to be able to partially deserialize messages.
  • Interaction with load balancing. Kafka wants to talk to a coordinator and then talk to brokers directly based on partition assignments. Have questions about how this will interact with proxy functions.
  • Outbound vs inbound. Running Kafka meshed inside k8s is a super interesting use-case, but it may be desirable (or even necessary?) just to track all metrics from the outbound client side. There are some control messages passed between broker nodes that could be interesting, but planning on starting with just client metrics first.

I will continue to post here as I continue to dig and discover more. This is mostly for my own personal edification rather than needing this to be accepted upstream, but I will hopefully open some WIP pull requests to get feedback if people have to time to review / evaluate viability of making this "a thing." Way too early to say now.

@tychedelia If I were to tackle this, I'd probably take the following approach:

  1. Ignoring the implementation, what are metrics/labels that would be worth tracking? I'd probably start by trying to mock out prometheus metrics as would be exported once this is all done.
  2. How does a proxy know whether it's transporting Kafka vs another protocol? We'll need to update the Server CRD to include a proxyProtocol variant for Kafka. And then we'll need to update the proxy api to include these hints for both inbound policy as well as the destination service's GetProfile protocol hinting.
  3. Finally, we'd update the inbound and outbound proxies to take advantage of this hinting. I'd probably start on the inbound (server) side.
  4. I don't think load balancing will apply, as I assume Kafka clients will never initiate connections to a service IP. Instead, I'd expect the client to only ever initiate connections to individual pods.