banzaicloud/koperator

Control upgrade koperator

metbog opened this issue · 6 comments

Describe the solution you'd like to see
Control rolling restarts during the upgrade

Additional context
If we have a very large kafka cluster or multiple kafka clusters in k8s cluster, rolling restart might be disruptive or inconvenient. We need to have more control over when the pods belonging to a particular deployment should be restarted

Can you provide more details as what kind of controlling of pod restarts are you thinking of?

for example:

  1. I wanna be sure that there is no any leader on broker before restart.
  2. I don't want to update cluster automatically where can be a topic with 1 replica only. There we should get an error.
  3. When I have multiple kafka clusters on k8s cluster and I don't want make an impact of a storage

I think there should be a way to add a label or annotation to pod of broker which I wanna postpone or control manually (make precheck from my side)

ienns commented

Hi,

The scenario is the following: we upgrade koperation deployment (probably with CDR object update), the version requires to restart all brokers, and it does it automatically right when operator pod is started.
It would be ok if restart would trigger demote broker process and wait till all leaders move to spare brokers.

@metbog

I don't want to update cluster automatically where can be a topic with 1 replica only. There we should get an error.

Can you be more specific what kind of updates?

When I have multiple kafka clusters on k8s cluster and I don't want make an impact of a storage

Could you elaborate more on this?

It would be ok if restart would trigger demote broker process and wait till all leaders move to spare brokers.

Most of the time a broker restart goes through quickly thus client applications connected to that broker will experience connection issues for a short period of time. This short period of time is not an issue as the kafka client library will reconnect automatically.

In case a broker restart takes a long time for whatever reason the controller broker will consider this broker as failed and will change the leader for all partitions that reside of the failed broker, so essentially all leaders are moved to healthy ISRs.

@ienns can you expand on why do you think a demote operation should be triggered for a broker before pod restart? What benefits provides to the above described functionality? Also, with what params would the demote operation invoked (assuming you that you referred to the demote broker operation provided by Cruise Control)?

closing stale ticket