Freshness tracker should fail a cluster iteration if all partitions for all consumers fails
Opened this issue · 0 comments
jyates commented
Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):
// if all the consumer measurements succeed, then we return the cluster name
// otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
// failure to successfully monitor the cluster).
return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
}
/**
* Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
* the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
* - burrow group status lookup fails
* - execution is interrupted
* Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
* than ideal for many cases, but it will be addressed later.
However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp
metric) since that will not get incremented for the failures.