redpanda-data/kminion

Any way of getting estimated consumer lag in seconds in promql?

sebw91 opened this issue ยท 6 comments

Kminion works great, thank you.

Anyone have a way of computing an estimated consumer time lag in promql?

I think we'd have to somehow join two series, kminion_kafka_consumer_group_topic_offset_sum and kafka_topic_high_water_mark_sum.

Conceptually the query should be something along the lines of:
time() - time_at_value(kafka_topic_high_water_mark_sum, kminion_kafka_consumer_group_topic_offset_sum + 1)

Where time_at_value is a method of getting the timestamp of a series at a value. Not something that exists in prometheus.

weeco commented

Hey @sebw91 ,
yes an approximate time lag is possible and I totally support that. The lag should really be exported in time, because this is what users really want to know.

I thought about how to solve this in the past already and I had different ideas. There's one exporter that uses interpolation, see: https://github.com/seglo/kafka-lag-exporter for more information. It's a bigger effort to implement this and currently I don't plan to spend this amount of time on kminion. If you are interested in trying this I'd suggest to come up with a proposal that we can discuss here, before starting with the implementation. It's not trivial to implement it so that it can scale in larger clusters though and that would be a requirement for KMinion.

Thanks a lot for the info. I was hoping it would be possible to do something in promql. From what I can see the data we need is all there to compute a very rough estimate. I would be fine without interpolation, just a lower bound on the topic_high_water_mark_sum to get a timestamp for consumer offset would be sufficient. This may not be possible though.

weeco commented

Oh I see what you mean. You are saying the information when a certain high water mark in a partition was reached is stored in Prometheus already (at least up to the retention), so that you put the intepolation logic into the PromQL somehow.

That's indeed a good idea! I'm not sure whether it's possibly with the available PromQL functions, but definetely worth a try to give that idea a try!

I think there is a way (kinda), using offset promql modifier! If the hwm of a topic 5 minutes ago is greater than the current offset_sum in consumer group, then we can determine that we are at least 5 minutes behind, for example:

`kminion_kafka_topic_high_water_mark_sum offset 5m > on (topic_name) kminion_kafka_consumer_group_topic_offset_sum + 1

I will continue exploring on my side, but this should do the trick for us.

This is an interesting request/subject. As mentioned already, Kafka has no notion of consumer lag in time units (seconds) itself. Probably because it actually depends on how fast a consumer can/is consuming a given partition. More in general, the current/expected throughput for consumption by a consumer.

For this reason, we approximate the consumer lag (all-partitions mode) in seconds using the consumer rate like this:

sum(kminion_kafka_consumer_group_topic_lag{job=~"$job",group_id=~"$group_id"})
  by (group_id,topic_name) / on (topic_name)
  group_left sum(rate(kminion_kafka_topic_high_water_mark_sum{job=~"$job"} [$__rate_interval]))
    by (group_id,topic_name)

This is used in Grafana, hence the usage of $__rate_interval which can be replaced by a static rate() range.

Hope this is useful somehow :)

@hhromic That's a clever prom query - very useful. Thanks very much. I think this is accurate enough for my use case.