Error on message lag on consumer offsets
HighWatersDev opened this issue · 7 comments
Hi,
I'm running kminion v2.2.0 and it started off just fine. However, after time, I'm getting these errors:
{"level":"info","ts":"2022-09-19T16:08:11.957Z","logger":"main.storage","msg":"Tried to fetch consumer group offsets, but haven't consumed the whole topic yet"}
{"level":"info","ts":"2022-09-19T16:08:12.031Z","logger":"main.minion_service","msg":"catching up the message lag on consumer offsets","lagging_partitions_count":1,"lagging_partitions":[{"Name":"__consumer_offsets","Id":6,"Lag":328}],"total_lag":328}
values.yaml
deployment:
volumes:
secrets:
- secretName: kafka-tls
mountPath: /secret/tls
kminion:
config:
kafka:
brokers:
- kafka-cluster-sc-kafka-0.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
- kafka-cluster-sc-kafka-1.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
- kafka-cluster-sc-kafka-2.kafka-cluster-sc-kafka-brokers.kafka.svc.cluster.local:9094
clientId: "kminion"
tls:
enabled: true
caFilepath: "/secret/tls/ca.crt"
certFilepath: "/secret/tls/tls.crt"
keyFilepath: "/secret/tls/tls.key"
minion:
consumerGroups:
enabled: true
scrapeMode: offsetsTopic # Valid values: adminApi, offsetsTopic
granularity: partition
allowedGroups: [ ".*" ]
ignoredGroups: [ ]
topics:
granularity: partition
allowedTopics: [ ".*" ]
ignoredTopics: [ ]
infoMetric:
configKeys: [ "cleanup.policy" ]
logDirs:
enabled: true
endToEnd:
enabled: true
probeInterval: 100ms
topicManagement:
enabled: true
name: kminion-end-to-end
reconciliationInterval: 10m
replicationFactor: 1
partitionsPerBroker: 1
producer:
ackSla: 5s
requiredAcks: all
consumer:
groupIdPrefix: kminion-end-to-end
deleteStaleConsumerGroups: false
roundtripSla: 20s
commitSla: 10s
serviceMonitor:
create: true
additionalLabels:
release: prom-stack
any progress on this issue? I'm having a similar issue..
Hello, this is an informational log message as far as I can tell. Is there any impact due to this?
This message indicates that it's not able to consume this specific partition. I'm not sure what the reason for this may be in your cluster, but you could also change the scrape mode to use the Kafka API rather than consuming the consumer offsets topic.
Hi weeco,
My issue is indeed when i'm using the offsetsTopic as the scrape mode.
My application scrapes many partitions of the consumer_offsets topic and then get stuck at 2 partitions as I show below:
{"level":"info","ts":"2023-06-13T13:18:54.490Z","logger":"main.minion_service","msg":"catching up the message lag on consumer offsets","lagging_partitions_count":2,"lagging_partitions":[{"Name":"__consumer_offsets","Id":35,"Lag":11450323},{"Name":"__consumer_offsets","Id":3,"Lag":11790495}],"total_lag":23240818}
Although the log is informational, this makes it so that the pod never enters the ready state. /ready returns a 503 and the pod cannot be hit from outside because of this.
Changing the scrape mode is a valid solution for me. Are you aware of downsides to using the adminAPI, other than missing kminion_kafka_consumer_group_offset_commits_total metric?
I ran into this, and I was able to recover by changing the leader of the partition that was reporting the lag. I'm not sure yet if there is a problem with that particular broker, as this was the only problem I was having in a fairly active cluster. To be fair though, this also means I lost my dashboards/alerting for a bit, so it's possible there were some other issues I just didn't catch.
I had first tried just restarting the leader broker, but that didn't seem to help at all.
Changing the scrape mode is a valid solution for me. Are you aware of downsides to using the adminAPI, other than missing kminion_kafka_consumer_group_offset_commits_total metric?
No side-effects besides less information that is accessible (number of commits), in fact most Kafka exporter just use the kafka API because that's way easier to implement.
Thanks for sharing the info @michaeljwood and @weeco. I might try restarting my brokers one by one to trigger an assignment of a new leader, or find another way to assign a new leader..
@reidmeyer It seems very rare and specific to very few Kafka environments. It's unclear why this happens, but I have very little information in regards to what I could possibly look at. The code looks fine and works well in many other clusters. My recommendation is to use the default scrapeMode = kafkaApi instead, if the offsetsTopic scrape mode is causing issues.