banzaicloud/koperator

Cruise Control's remove_broker task completedWithError handling in Koperator

bartam1 opened this issue · 1 comments

Is your feature request related to a problem? Please describe.

When a Cruise Control task is ended up with completedWithError result then Koperator takes it as a succeeded operation.

case kafkav1beta1.CruiseControlTaskCompleted, kafkav1beta1.CruiseControlTaskCompletedWithError:
When a downscale is succeded the Koperator removes the broker pod, PVC and broker configmap from the kubernetes cluster. This can lead to different problems such as data loss or missing replicas. Cruise Control task can be ended up with completedWithError result in two cases.

  • Case (1) When CC task cannot be executed at all. (HTTP response 500 code)
    e.g:
    < HTTP/1.1 500 Internal Server Error {"errorMessage":"Error processing GET request \u0027/user_tasks\u0027 due to: \u0027There are already 1 active user tasks, which has reached the servlet capacity.\u0027.","version":1,"stackTrace":"java.lang.RuntimeException: There are already 1 active user tasks, which has reached the servlet capacity.\n\tat com.linkedin.kafka.cruisecontrol.servlet.UserTaskManager.ins
    OR
    < HTTP/1.1 500 Internal Server Error {"version":1,"stackTrace":"java.util.concurrent.ExecutionException: com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: [RackAwareDistributionGoal] Cannot remove Replica[isLeader\u003dfalse,rack\u003dkafka-0.kafka.svc.cluster.local,broker\u003d0,TopicPartition\u003dexample-topic-2,origBroker\u003d0,isOriginalOffline\u003dtrue,isCurrentOffline\u003dtrue] from DEAD broker 0 (has 6 replicas). Add at least 1 broker. Add at least 1 broker.\n\tat java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)\n\tat java.base/java.util.concurrent.CompletableFuture
    OR
    < HTTP/1.1 500 Internal Server Error "errorMessage":"Error processing POST request \u0027/remove_broker\u0027 due to: \u0027com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.IllegalArgumentException: Broker 1 does not exist

  • Case (2) When the task has been executed but later will be interrupted unexpectedly (HTTP response 200 code). In this case, there will not be any error message.

Describe the solution you'd like to see

New cruiseControlStates:

  • Case (1) When a task can not be started. - GracefulDownscaleExecutionError
  • Case (2) When a task has been executed and later it will be interrupted. - GracefulDownscaleCompletedWithError

Error message at gracefulActionState:

  • Case (1) The error message should be the "errorMessage" from the HTTP response or when is not available then stacktrace.
  • Case (2) The error message should be "there are remained partitions: X replica and Y leader"

Configuration:

kafkacluster.spec.cruisecontrolconfig.cruiseControlTaskSpec.downscaleFailurePolicy

downscaleFailurePolicy:

  • default - policy
  • perBroker - map["brokerID"]policy

policy:

  • ignore: Take task as completed so the broker, pvc and configmap will be removed
  • freeze: Stop retry until the policy gets "retry" or "ignore" (in this case the user can do steps manually to solve the problem and later it can change back to retry or ignore)
  • retry: Retry the task for unlimited times in every X sec

To specify the duration between the retries:

  • kafkacluster.spec.cruisecontrolconfig.cruiseControlTaskSpec.failedRetryDurationSeconds - int (default 30)

Actual policy should be shown in the status:

  • status.brokerState.gracefulActionState.downscaleFailurePolicy - policy

Error handling:

  • Broker should not be removed from the status either from the Kubernetes cluster in both cases until the downscale operation succeeded or there are remained partitions on the broker. When a broker is not available then it should not be removed.
  • When a new broker is added with the same ID as the removed and downscale errored one, then the cruiseControlState should be changed to GracefullUpscaleRequired thus upscale operation will run on that broker. This would give more maneuverability for the user thus errored downscale can be reverted.
  • When there is a broker with downscale errored task in the status, CC controller does a reconcile every 30sec (or with that time what is specified) and retries the errored downscale task.
  • Errored downscale task should be considered retry-able in every case. I found that when a removed broker is still in the CC kafka-cluster state but it is not available (e.g. missing pod) it still can be removed. In this case CC will create replicas and new leaders from other available brokers which were on the missing broker and put those partitions to other brokers. When the missing broker comes back, those partitions will be deleted from it. When the missing broker is not in the CC kafka-cluster state (it is not available for 15min or what is set in the settings) there will be error message with HTTP code 500.
  • When there is an errored downscale task, other downscale operation should be permitted in case when the policy for the errored task is set to "freeze".
  • When multiple brokers want to be removed and the first downscale task gets an error then there is a big chance to the others will be ended up with error also. When downscale is ended up with an error then downscale should not be started on other removed brokers until the errored downscale task is not succeeded.
  • Remove broker operation should be used with exclude_recently_removed_brokers and exclude_recently_demoted_brokers flags. This prevent to move partitions to the recently removed brokers (CC considers completedWithError result as broker has been removed)
  • Upscale operations can run when there is an errored downscale task. Upscale has a bigger priority than downscale.
  • When there are multiple errored tasks then with the use of the freeze policy the order of the task execution can be changed. e.g: broker 0,1,2 got downscale error and their policy is set to freeze when broker 1 policy has changed back to retry then downscale operation will start on broker 1.

This has been implemented in the latest version.