Export partition status

Question

Export partition status

ercliou-zz opened this issue 7 years ago · 6 comments

I'd like to export the status of each partition too.
We can always write some logic at prometheus end, but Burrow already does this well.
https://github.com/linkedin/Burrow/wiki/http-request-consumer-group-status
These are the valid status strings: NOTFOUND, OK, WARN, ERR, STOP, STALL

Edit:
We shall model them as separate time series

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 1

something like https://www.robustperception.io/exposing-the-software-version-to-prometheus/

Answer 1 · 2018-03-14T09:57:03.000Z

@ercliou
Hello.
I also want this metrics.
You seem to have made some changes after forking, but are you planning to send a patch upstream?

Answer 2 · 2018-03-14T19:30:39.000Z

I ended up implementing by sending all metrics at every scrap. When the status is not the matched one, it sends 0. This increases 1:5 with number of partitions (could be a problem if you have a lot of them).
e.g.

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"REWIND"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STALL"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"WARN"} 0

This is so each one of them stay as one independent time series. The reason of this is that I could query the lag + status at Grafana by partition.
Query:

kafka_burrow_partition_lag{group="MY_GROUP",topic="MY_TOPIC"}
* on (topic, partition, group) group_left(status) 
(kafka_burrow_partition_status{group="MY_GROUP",topic="MY_TOPIC"} == 1)

I could send a patch if @jirwin agrees with this :)

Answer 3 · 2018-03-17T00:30:58.000Z

I'm +1 to this. Partition count isn't generally unbound. Maybe it could be enabled by a command line flag, so people can use their own judgement as to whether the surge in new time series is acceptable to them. Maybe --per-partition-stats or something?

Answer 4 · 2018-04-20T03:19:37.000Z

How about we define a numeric scheme for the value of this time series? This will save us from 1:5 time series bloat. Our system has 2525 partitions for 52 topics. I am definitely worried about the bloat.

NOTFOUND = 1
OK = 2
WARN = 3
ERR = 4
STOP = 5
STALL = 6

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC"} 2

Answer 5 · 2018-04-21T14:17:59.000Z

Hi @shibug , I explained a lil bit about the reasoning behind in the above PR (centered mostly around Grafana).

We have 15k partitions and haven't encountered performance problems (yet).
I can't look into command line flag right now, if someone would like to look into this, I appreciate it.

Answer 6 · 2018-07-09T22:34:48.000Z

Fixed by #19.