Leader election process should not rely on pod ready status

tarantool-operator/pkg/election/election.go

Lines 158 to 172 in 9cb8b09

    
           func (r *LeaderElection) CanBeLeader(cluster api.Cluster, pod *v1.Pod) bool { 
        
           	if pod.GetDeletionTimestamp() != nil { 
        
           		return false 
        
           	} 
        
           	if !utils.IsPodRunning(pod) { 
        
           		return false 
        
           	} 
        
           	if !cluster.IsBootstrapped() { 
        
           		return utils.IsPodDefaultContainerReady(pod) 
        
           	} 
        
           	return utils.IsPodReady(pod) 
        
           }

Expected behavior

Any pod in which the cartridge is functioning and has a non-empty config can be elected as the leader. (when cluster already bootstrapped)

Pod ready condition may depends from readiness probe which can be quite different between clusters or even replicasets

Actual behavior

Two scenarios are possible:

deadlock, operator cannot choose leader to apply some important config, and pods cannot be ready without a config.
the operator relies only on ready status and chooses as the leader a pod that has lost config data. This may lead to the fact that the config either does not reach the entire cluster, or partially overwrites the current one

Environment

Tarantool Operator 1.0.0-rc1

related:
#116

For some reason, the ticket was closed, but the problem remained uncorrected, nodes are still added to the topology one by one and not all together, as happens in the cartridge cli.

The subject is much deeper than it seems at first glance. We can't add all instances to topology at one time, because we don't have a resource which knows about all instances, each Role knows only about self instances. Cluster knows nothing about instances. So we just add each instance when it's ready.

I agree that custom readiness probe can broke everything and we should think about it.
But i disagreed with second "Actual behavior", because cartridge config has exclusive two phase lock, so pod that lost config data can't take it. Also Roles roles can't select a different because leader entry is a part of ClusterStatus, so if one if roles tries to find leader during reconciliation when leader already selected by another role it will fail here, it should save us from split-brain. So i think that "deadlock" is a worst behavior in this implementation.

Anyway we need to fix a case with custom readiness probes and we will discuss it inside tarantool team

because cartridge config has exclusive two phase lock, so pod that lost config data can't take it.

this is a delusion, because when a node does not have a config, then it does not have a topology, as a result, only those nodes that are applying right now in the topology will participate in the two-phase commit.
This means that when adding nodes one by one, not all possible hosts will participate in a two-phase commit.

And then fatal consequences - the host on which the topology is applied begins to send out its empty config to everyone knows about, which will lead to the fact that the entire cluster will remove all keys from the config (except for the topology)

In our fork of the old version, we only select pod as leader when the cartridge is in the configured state.

unconfigured state is allowed only during the initial bootstrap and provided that absolutely all pods in the cluster available (cartridge already starts but pod may not be ready status) and are also in unconfigured state.

An example of steps for the old version, I admit that this can happen in the new version too

for example, there are 12 hosts in the cluster: (swim broadcast disabled)

remove pvc and stop pod 12
sts immediately start new pod 12 and create fresh volume
suppose that new pod is chosen as the leader
12 adds 12 to topology, ok .. got a cluster of size one without a config.
delete pod 5 , sts immediately starts a new one , config still present
12 adds 5 to topology... and 5 lose config and we got the cluster of size 2

each Role knows only about self instances. Cluster knows nothing about instances.

Some other architecture in the new version, later I will analyze in more detail.

In our fork of the old version, we only select pod as leader when the cartridge is in the configured state.

Looks like that we did something very similar in our EE version, i will talk with management about moving it into CE

We discussed the problem within our team, this situation is really possible, we have a solution to fix this behavior, but instances will still be added one by one. We'll try to release a fix as soon as possible.

Not fixed at least on 1.21 k8s

	func (r LeaderElection) CanBeLeader(cluster api.Cluster, pod v1.Pod) bool {
	if pod.GetDeletionTimestamp() != nil {
	return false
	}

	if !utils.IsPodRunning(pod) {
	return false
	}

	if !cluster.IsBootstrapped() {
	return utils.IsPodDefaultContainerReady(pod)
	}

	return utils.IsPodReady(pod)
	}