Yolean/kubernetes-kafka

Kafka Cruise Control self healing failing

wmorgan6796 opened this issue · 9 comments

Hi,

I've been running Kafka for a few months now and have just transitioned to using this repo for my kubernetes Kafka. (Great job btw) I'm just wondering if there has been any reported issues with the Cruise Control add-on. I'm seeing pretty consistently that the Self-healing is failing for some reason I can't figure out and the model stops training itself after it reaches 20%. Any help would be appreciated.

Thanks

@double16 Do you know what to look for in a case like this? I have no experience of Cruise Control in production.

Here is a sample log regarding the self-healing:

[2019-03-08 19:04:54,045] WARN Self-healing has been triggered. (com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier)
[2019-03-08 19:04:54,045] DEBUG Received notification result {FIX,-1} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-03-08 19:04:54,060] INFO Fixing anomaly {
Metric Anomaly windows: [1552071900000] description: Metric value 7620.000 of BROKER_FOLLOWER_FETCH_LOCAL_TIME_MS_MAX for brokerId=1,host=172.20.114.248 in window 2019-03-08_07:05:00 is out of the normal range for percentile: [10.00, 90.00] (value: [392.400, 2943.000] with margins (lower: 0.200, upper: 0.500)) in 100 history windows from 2019-03-08_07:00:00 to 2019-03-08_10:45:00.
} (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)
[2019-03-08 19:04:54,060] TRACE Fix the cluster by removing the leadership from the broker: brokerId=1,host=172.20.114.248 (com.linkedin.kafka.cruisecontrol.detector.KafkaMetricAnomaly)
[2019-03-08 19:04:54,060] INFO Self-healing failed to start. (com.linkedin.kafka.cruisecontrol.detector.AnomalyDetector)

Any update on this?

Sorry, been tied up with things. There's not much in the logs you provided. Is there anything else?

Upon further inspection, I found that the metrics anomaly that Cruise Control wasn't able to fix was due to code that hadn't been written yet, namely this

public boolean fix() throws KafkaCruiseControlException {
    // TODO: Fix the cluster by removing the leadership from the brokers with metric anomaly (See PR#175: demote_broker).
    LOG.trace("Fix the cluster by removing the leadership from the broker: {}", _brokerEntity);
    return false;
  }

The thing I haven't been able to figure out is to why the model stops training at 20%. I have tried reaching out through the Cruise Control gitter, but no one was responsive. I've been continuing to test with different settings to figure out how to best set up cruise control.

I'm not sure about the 20% myself. I've seen self-healing in action. If my memory serves me, I've never seen training go beyond 20%.

I guess 20% is about as trained as we can get in the IT business :)

Hmmmm, I can't seem to figure out how to get the self-healing to work. Might just create a small micro-service that listens to the stateful set controller and starts the cruise control re-balance from there.

Having talked with some of the developers on the Cruise Control Gitter chatroom, apparently the 20% training is for the future when they will change the cruise control alogrithm from a hueristic model to a linear regression. Also figured out that part of the issue I was having was due to the fact that metric anomalies healing feature hasn't been completed as of yet. Hence why the self-healing wasn't working for me.