seglo/kafka-lag-exporter

Metrics are not reset after consumer restart

aikoven opened this issue · 4 comments

In my app each consumer gets unique generated client_id. When I restart all consumers, Kafka assigns partitions to their new client_ids, but kafka_consumergroup_group_lag metrics for old client_ids remain in the output until I restart kafka-lag-exporter.

The graph below shows stacked values of kafka_consumergroup_group_lag for each client_id on the first start, after restart of consumers, and then after restart of kafka-lag-exporter:

image

The prometheus query is:

sum(kafka_consumergroup_group_lag) by (client_id)
seglo commented

Thanks for trying out the project @aikoven. The client_id is generated by Kafka clients (by default) and is used to differentiate different Kafka clients. You probably want to aggregate on the group label, which is the same group.id you specify in your Kafka consumer properties configuration.

In this case, I would get a single series for the whole group. But I'd like to monitor the lag for each consumer in the group separately.

The problem is — even if a consumer is no longer assigned a partition, its metrics still show up.

seglo commented

I see. Yes, this is also a problem for old consumer groups that no longer exist. When a metric is set for a particular set of labels it will remain on the prometheus endpoint until it's explicitly unset. The fix would be for the exporter to either reset all metrics each reporting interval, or determine how to selectively unset metrics that are no longer valid according to information retrieved from the consumer group coordinator.

seglo commented

@aikoven LMK if 0.4.1 addresses this problem for you!