Export partition status
ercliou-zz opened this issue · 6 comments
I'd like to export the status of each partition too.
We can always write some logic at prometheus end, but Burrow already does this well.
https://github.com/linkedin/Burrow/wiki/http-request-consumer-group-status
These are the valid status strings: NOTFOUND, OK, WARN, ERR, STOP, STALL
Edit:
We shall model them as separate time series
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 1
something like https://www.robustperception.io/exposing-the-software-version-to-prometheus/
@ercliou
Hello.
I also want this metrics.
You seem to have made some changes after forking, but are you planning to send a patch upstream?
I ended up implementing by sending all metrics at every scrap. When the status is not the matched one, it sends 0
. This increases 1:5 with number of partitions (could be a problem if you have a lot of them).
e.g.
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"REWIND"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STALL"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"WARN"} 0
This is so each one of them stay as one independent time series. The reason of this is that I could query the lag + status at Grafana by partition.
Query:
kafka_burrow_partition_lag{group="MY_GROUP",topic="MY_TOPIC"}
* on (topic, partition, group) group_left(status)
(kafka_burrow_partition_status{group="MY_GROUP",topic="MY_TOPIC"} == 1)
I could send a patch if @jirwin agrees with this :)
I'm +1 to this. Partition count isn't generally unbound. Maybe it could be enabled by a command line flag, so people can use their own judgement as to whether the surge in new time series is acceptable to them. Maybe --per-partition-stats
or something?
How about we define a numeric scheme for the value of this time series? This will save us from 1:5 time series bloat. Our system has 2525 partitions for 52 topics. I am definitely worried about the bloat.
NOTFOUND = 1
OK = 2
WARN = 3
ERR = 4
STOP = 5
STALL = 6
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC"} 2
Hi @shibug , I explained a lil bit about the reasoning behind in the above PR (centered mostly around Grafana).
We have 15k partitions and haven't encountered performance problems (yet).
I can't look into command line flag right now, if someone would like to look into this, I appreciate it.