ClusterLabs/booth

Ticket not revoked when starting without majority

seabres opened this issue · 3 comments

Scenario:

  • Node1 has ticket granted and majority.
  • Node1 network disconnect
  • booth restarted before ticket expiry
  • booth do endless elections and the granted ticket does never expiry
    Tested today with GIT head.

The booth restart in this case can happen, when it is running as cluster resource and there will be a node migration inside the cluster.

In the described case, the granted ticket will never expire (until booth sees the other booth instances), even it does not have majority.
This is not the intention of booth, which should guarantee a ticket granted only once (with some expire times).

The reason for this behavior comes from the struct member in_election.
During election the CIB ticket is not updated, but elections_end immediately start a new election after nobody won it. Therefore, in ticket_cron in_election is always 1.

It might be better to let the ticket normally expire (dont set the expire time to zero) and revoke it locally.
If the booth instance with granted ticket is not able to win the election until the ticket expiry date, then it shall expire. Some other instance with present majority will take over and the ticket exists only one time.

Rainer

Good catch!

I've never been entirely happy with the in_election flag, which for the most part served to mark the ticket as not (yet) valid. Perhaps we need an extra copy of the ticket which is up for election and to keep the existing ticket intact so that it can be managed and expired properly.

This will be overkill. Ticket not valid does not mean, we are not allowed to update CIB.
One possible solution is to remove the clearance of term_expires in new_election.
Then we need a similar construction in ticket_cron as the last if statement, which calls ticket_loss, in case of in_election not zero.
As far as i understand ticket_loss this is exactly what is needed in that case. It remembers the ticket lost and update the CIB unconditional, which then reflects the state what booth has in memory.
When i am not false, removing the !tk->in_election from the last if statement in ticket_cron is then enough.
Outcome of this two changes would be, the ticket expires normally and on expiry, state will change to FOLLOWER, it has no owner and the CIB is updated with this information.

OK. I think that we have this fixed now. Please test. Thanks for reporting and analysis!