elastic/elasticsearch

A new cluster coordination layer

ywelsch opened this issue ยท 2 comments

The cluster state contains important metadata about the cluster, including what the mappings look like, what settings the indices have, which shards are allocated to which nodes, etc. Inconsistencies in the cluster state can have the most horrid consequences including inconsistent search results and data loss, and the job of the cluster state coordination subsystem is to prevent any such inconsistencies. Ideally this subsystem should also be easy to configure correctly and it should perform well in a variety of situations.

The goal of this project is to rebuild the cluster state coordination subsystem, making it more reliable, performant and user-friendly. Better reliability will be achieved by basing the core algorithm on strong theoretical underpinnings and extensive testing. Misconfiguration of the minimum_master_nodes setting, one of the most common causes for cluster state inconsistencies, will be addressed by having this property fully managed by the system itself.

We've built a prototype to validate the approach and, based on our experience with this, present the following development roadmap for this new cluster coordination and consensus layer, targeting ES 7.0:

  • core algorithm: Adds term and voting configuration to cluster state (#32100) and directly implements the transition rules from the spec (#32171)
  • node discovery: builds list of peers based on UnicastHostProviders, establishes permanent connections and identifies active master nodes (#32246, #32642, #32643, #32939, #33012, #36603)
  • cluster state publication: Pipeline to publish a single cluster state to the other nodes in the cluster (#32584)
  • deterministic / unit-testable MasterService (& async Discovery.publish method): increases testability of MasterService and the discovery layer (#32493)
  • election scheduling (#32846) and prevoting (#32847), avoiding duelling elections
  • leader election & lifecycle modes (Candidate, Leader, Follower): introduces the basic lifecycle states that a node can go through (#33013, #33668)
  • node joining (voting and non-voting join), adding joining node to cluster state (#33013)
  • leader & follower failure detector (#33024, #33917, #34049, #34147)
  • deterministic testing of cluster formation (#33668, #33713, #33991, #34002, #34039, #34181, #34241)
  • term bumping, ensuring all followers have voted for the leader (#34335, #34346)
  • node leaving cluster on disconnect or failure (#34503)
  • auto-reconfiguration rules: provides the basis for the cluster to stay highly available by shifting votes from unavailable to available nodes. (#33924, #34592, #35217)
  • transport layer: transport actions & mock transport for unit testing (#33713)
  • storage layer: diff-based storage for the cluster state and current term (#33958)
  • cluster state application (& acking): Apply a committed cluster state on each node and acknowledge that it has been applied (#34257, #34315)
  • cluster bootstrap method to inject an initial state + configuration (#34345, #34961, #35847)
  • voting configuration exclusions API, which allows to safely scale down a cluster from 2 nodes to 1 (#35446, #36007, #36226)
  • auto-bootstrapping and auto-scaling in integration tests based on bootstrap / voting configuration exclusions API (#34365, #35446, #35488, #35678, #35724)
  • diff-based cluster state publishing (#35290, #35684)
  • lag detection: remove nodes from cluster when they fall too far behind the master (#35685)
  • state recovery / recover_after* settings (#36013)
  • support for (rolling) upgrades from 6.x (#35443, #35737)
  • best-effort auto-bootstrapping on unconfigured discovery to provide good OOTB experience #36215
  • correctly respect the no_master_block setting (#36478)
  • Introduce deterministic task queue #32197
  • Randomized testing of CoordinationState #32242
  • Fix JoinTaskExecutor identity issue #32911
  • Fix node logging issue #33929
  • Output voting tombstones in XContent representation of cluster state (#35832)
  • Add zen2 discovery type (#36298)
  • integration with master-ineligible nodes (#35996, #36247)
  • Full cluster restart upgrade, initial election does not use proper CS version (-> use metadata version instead) (#37122)
  • Add restarts to CoordinatorTests (#37296)
  • node join validation (#37203)
  • Migrate Zen2 unit tests from InMemoryPersistedState to GatewayMetaState (#36897)
  • Relax bootstrapping to work on discovery of a quorum of the nodes, rather than on all of them. Use a placeholder ID for the unknown nodes. (#37463)
  • unsafe recover API / command line tool: To be used when a quorum of master-eligible nodes has been permanently lost (#37696, #37979, #37812)
  • handling dangling indices and nodes that were previously part of another cluster (#37775)
  • Only have node as master that have active vote (#37712, #37802)
  • Additional bootstrapping methods? Check whether we have a good story for all typical deployment systems (docker, kubernetes, ...)
  • Security model (voting exclusions API associated with manage cluster privilege)
  • Remove the need for minimum_master_nodes in a rolling upgrade, instead using the minimum_master_nodes from the previous master for bootstrapping. (#37701)
  • Bubble exceptions up in ClusterApplierService (#37729)
  • Prioritize publishing cluster state to master-eligible nodes (#37673)

After 7.0 FF:

  • Deprecate any Zen1-specific settings and rename any others that mention zen but which are still in use. (#38289,#38333,#38350)
  • Make discovery.type non-configurable/internal-only / move Zen1 to tests only (#39466)
  • Scaling tests (e.g. election clashes when having large cluster states)
  • Do not close bad indices on state recovery (#39500)
  • Add stats (e.g. expose stuff like node term, or discovery information while the node has troubles forming / joining a cluster) (#35993)
  • Contemplate timeouts, retries, etc. and consider improvements to default values (#38298)
  • Check logged messages are useful and at the appropriate levels (#39756, #39950).
  • Docs ๐Ÿ“œ (#34714, #36959, #36954, #36942, #36909) also docs for full-cluster and rolling upgrades

Post 7.0:

  • Smoother master failovers by not exposing those to the ClusterApplierService, i.e., delay putting up a NO_MASTER_BLOCK.
  • Abdicate on leader shutdown (appoint new leader)
  • Add "has_voting_exclusions" flag to cluster health output (#38568)
  • Enqueueing cluster state updates to behave as well as possible in an overloaded cluster.
  • Verify that a master which cannot write its cluster state stands down (or maybe actively abdicates)
  • Deal appropriately with duplicate nodes (see e.g. #32904)
  • High-level rest client integration for new APIs
  • Avoid bootstrapping if any discovered peer has a nonzero term
  • Work with support to enhance cluster diagnostics analysis tool.

Pinging @elastic/es-distributed

Closing this one as shipped in 7.0. Possible follow-ups will be tracked separately.