Add Sequence Numbers to write operations

Question

Add Sequence Numbers to write operations

bleskes opened this issue 9 years ago · 16 comments

Answer 1 · 2015-05-05T17:53:50.000Z

First use case - faster replica recovery

I'd argue the first use case is making replication semantics more sound :)

Answer 2 · 2016-03-08T06:53:13.000Z

It's not clear as to what would happen in the following split brain scenario (scenario-1):

split occurs, forming two networks
the network that didn't have a master, elects a master (call this network-2)
the master will elect a new primary (in network-2)
the primary in network-2 now has incremented term value (say 11). The primary in network-1 continues to have the same term value (10 in this example)
The connection between the networks is re-established.

In this case we need a strategy for reconciling the differences in the indexes, if there were change operations in both the networks. Does a strategy like that exist today? So far it seems like this situation is preventable by using min_master_nodes. However in case min_master_nodes is not set appropriately, some default strategy should come into effect I would think.

An example strategy could be:

Keep logs of write operations in both networks for a configurable amount of time. If the networks' connectivity is restored within this time period: (a) Drop all nodes in network-2 to read-only replica status (b) Attempt to reconcile the differences, and use network-1's state if the differences are not reconcilable. (c) Remove read-only status
If the connectivity isn't restored within that time, when connection is restored, all indices in network-2 that have competing primaries in network-1 will lose their shards, and replicas are created from network-1.

Another interesting situation (scenario-2) to consider:

Continuing with the scenario described above until (4) ...
network-1 has another split, creating network-1 and network-1a. Network-1a gives term value of 11 to the new primary in that network.
network-1 completely fails, and connectivity between network-1a and network-2 are restored. Now we may have a scenario where the subsequent change operations might not fail but still lead to different indexes in the replicas, with some operations failing some of the time, creating a messy situation.

This would happen if there is no reconciliation strategy in effect.

I do see that the sequence numbering method will keep shards that have connectivity to both the networks, in integral state, in the case of scenario-1. In the case of scenario-2, it is possible that the same shard gets operations with same term values from multiple primaries, and that again could create faulty index in that replica.

I am still trying to understand Elasticsearch's cluster behavior. It's possible that I might have made assumptions that aren't correct.

Answer 3 · 2016-03-08T08:05:31.000Z

In this case we need a strategy for reconciling the differences in the indexes, if there were change operations in both the networks. Does a strategy like that exist today?

The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master. As you noted, this falls apart when there are two residing masters in the cluster. Indeed, the only way to prevent this is by setting minimum master nodes - which is the number one most important setting to set in ES (tell it what the expected cluster size is)

If min master nodes is not set and a split brain occurs, resolution will come when one of the masters steps down (either by manual intervention or by detecting the other one). In that case all replicas will "reset" to the primary designated by the left over master.

Drop all nodes in network-2 to read-only replica status

This is similar to what ES does - nodes with no master will only serve read requests and block writes (by default, it can be configured to block reads).

it is possible that the same shard gets operations with same term values from multiple primaries, and that again could create faulty index in that replica.

If the term is the same from both primaries, the replica will accept them according to the current plan. The situation will be resolved when the network restores and the left over primary and replica sync but indeed there are potential troubles there. I have some ideas on how to fix this specific secondary failure (split brain is the true issue, after which all bets are off) but there are bigger fish to catch first :)

Answer 4 · 2016-03-08T08:50:05.000Z

Thank you very much for your clarification. I rather enjoy all these discussions and your comments.

The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master.

The situation will be resolved when the network restores and the left over primary and replica sync

I would like to clearly understand the reset/sync scenarios. What triggers reset/sync?

I can think of a couple of "normal" operation scenarios

I would think that whenever a node joins a network, the master would initiate a sync/reset.
If a replica fails for a request, I suppose the primary should keep attempting a sync/reset, otherwise the replica might keep diverging, and at some point the master has to decommission that replica, otherwise the reads would be inconsistent.

In the case of split brain, with multi-network replicas (assuming min master nodes is set), primary-1 has been assuming that this replica R (on this third node, say N-3) has been failing (because of its allegiance to primary-2 ) but still is in the network. Hence it would attempt sync/reset. How does this protocol work? Should master-1 attempt to decommission R at some point, going by assumption (2)?

This problem will occur in a loop if R is decommissioned but another replica is installed on N-3 in its place, by the same protocol. There will be contention on N-3 for "reset"-ing replica shards by both the masters.

I suppose one way to resolve this is by letting a node choose a master if there are multiple masters. If we did this, then whenever a node loses its master, it would choose the other master, and there will be a sync/reset and all is well.

However if the node chooses its master, the other master will lose quorum, and hence cease to exist, which is a good resolution for this issue in my opinion.

Answer 5 · 2016-03-08T09:19:28.000Z

The two issues you mention indeed trigger a primary/replica sync. I'm not sure I follow the rest, I would like to ask you to continue the discussion on discuss.elastic.co . We try to keep github for issues and work items. Thx!

Answer 6 · 2016-03-08T09:40:47.000Z

Sure. Posted it here: https://discuss.elastic.co/t/sequence-numbers-to-write-ops-split-brain-scenario/43748

Answer 7 · 2016-04-05T08:19:52.000Z

any plan to release this?
it seems after this release, u guys will make ES a AP system? will u provide config paramters to allow users to control ES to be a AP or CP system eventually?

Answer 8 · 2016-04-05T09:09:35.000Z

@makeyang this will be released as soon as it is done. There's still a lot of work to do.

it seems after this release, u guys will make ES a AP system? will u provide config paramters to allow users to control ES to be a AP or CP system eventually?

ES is currently and will stay CP in the foreseeable future. If a node is partitioned away from the cluster it will serve read requests (configurable) but will block writes, in which case we drop availability. Of course in future there are many options but currently there are no concrete plans to make it any different.

Answer 9 · 2017-03-13T19:27:56.000Z

In the "Consistency and Replication in Elasticsearch" talk at Elastic{on} Mar 8, 2017 this issue was brought up as possibly providing groundwork for a change API.

@bleskes, @ywelsch, @jasontedor (pardon, I can't remember which of you brought this up): Was that in reference to issue #1242 or was that about a different form/meaning of change API ?

Many thanks for you talk!

Answer 10 · 2017-03-13T19:32:27.000Z

@milutz Thank you for attending and your interest!

Yes, the open issue for the changes API is #1242. That issue gives a high-level overview of some possible goals for the changes API but the actual design is yet to be worked out. Sequence numbers will form a basis for what we will eventually build.

Answer 11 · 2017-03-13T19:55:25.000Z

@jasontedor: Awesome! I'm very interested in both of these efforts and wanted to confirm what I should subscribing-to-notifications for. Many thanks, and again many thanks to all of you for efforts and your awesome talk - it filled in many questions I've had about the platform!

Answer 12 · 2017-05-03T19:25:51.000Z

Very curious on this issue and #1242 are these high priority issues for elastic and is there any sense of a timeline?

Answer 13 · 2017-05-03T22:20:56.000Z

@andrewluetgers We do not provide timelines. I can tell you this:

sequence IDs and some of the features they enable will ship in 6.0.0, it is one of our highest priorities for that release
the changes API is to be determined, it will definitely not ship with 6.0.0

Answer 14 · 2017-05-04T02:02:42.000Z

will Term be introduced for master election for ES ?

Answer 15 · 2017-05-04T02:08:09.000Z

will Term be introduced for master election for ES ?

Yes.

Answer 16 · 2019-02-05T23:38:25.000Z

The work prescribed in this issue now completed and will be part of the coming 6.7 and 7.0 releases. There are still some small follow ups we want to do, but they do no need to be tracked as part of this issue. We now consider this completed.

Add Sequence Numbers to write operations

Introduction

Warning, research ahead

What is a Sequence

Changes to indexing flow on primaries

Changes to indexing flow on replicas

Global Checkpoint# increment on replicas

First use case - faster replica recovery

Road map

Basic infra

Replica recovery (no rollback)

Translog seq# based API

Primary recovery (no rollback)

Primary promotion

Live replica/primary sync (no rollback)

Primary recovery with rollback

Replica recovery with rollback

Live replica/primary sync with rollback

Seq# as versioning

Shrunk indices

Adopt Me

TBD

Completed Miscellaneous