[Proposal] Implement virtually synchronous activation directory hand-off

Question

[Proposal] Implement virtually synchronous activation directory hand-off

ReubenBond opened this issue 5 months ago · 0 comments

Background

The default grain directory in Orleans is in-memory, unreplicated, and arranged as a range-partitioned ring. When silos join, they take ownership of a section of the ring. When silos leave, ownership over their section of the ring is immediately transferred to a successor.
While the directory has worked well for most users for many years, it has some long-standing shortcomings. For example:

When a silo is added, there is a period during which an activation race can occur for a subset of already-active grains while the new silo transfers its directory range from the previous owner.
When a silo is removed, all grains mapped to that silo's directory range must be deactivated. This has the effect that in a 3-node cluster, killing one node results in approximately 2/3 of grains being destroyed:
- Around 1/3 for the grains activated on that node
- Around 1/3 for the grains registered to that node
  Both issues cause disruption in the form of activation churn and degrade the user experience.

Proposal

This issue is a proposal to alleviate both issues by adopting the Virtual Synchrony methodology more fully than we currently do today. Essentially, this involves a lock-step handover process where each successive 'view' (directory membership config) does not begin servicing requests until it first transfers state from its predecessor. Predecessors halt operation once they are contacted by their successor. If there is any point where continuity between views is broken, the directory membership version at which that continuity was broken is recorded and clients learn of this break in continuity by querying each successive view, taking appropriate action either by attempting to salvage activations or by destroying them.

We have the pre-requisites in-place to achieve this:

Cluster membership is strong-consistency, durable, and monotonically versioned.
Clustering uses a crash-stop model: when silos re-join after a fault, they receive a new identifier and are treated as entirely new members.
The grain directory inherits its versioning from cluster membership.
The in-memory directory has a GrainDirectoryHandoffManager (GDHM) component which is responsible for transferring directory sub-ranges between silos on membership change.

What we need to change is the following:

When a directory partition expands/contracts, it must keep the added/removed sub-range 'wedged' until hand-off completes (respectively)
GrainDirectoryHandoffManager should expose a facility to check if a grain resides in a 'wedged' range.
GDHM must track the most recent Data Loss Version (DLV): this is the most recent time where it determined that data-loss may have occurred.
During hand-off, GDHM also passes the DLV to the new owner.
When receiving a new partition (expanding the owned range by merging the two), the DLV is set to the most recent of the current partition's DLV and the received partition's DLV
When a silo learns of a cluster membership change, it asynchronously queries all new directory partitions to learn their data-loss versions. For each activation in the catalog, if the new owner's DLV is greater than the activation's GrainAddress.MembershipVersion value, that grain needs to be recovered:
- Either by re-registering the grain activation, deactivating it if registration fails,
- or by deactivating the grain activation immediately.

Future work

After implementing this proposal, there is an extension which I would like us to consider. Directory membership version is currently equivalent to cluster membership version. This impacts us in two ways:

There is no way to coordinate directory startup/shutdown separately from silo startup/shutdown.
Silos cannot elect to not take part in the directory.
Therefore, an extension to this proposal is to insert a level between Cluster & Directory membership which allows for dynamic group membership. My initial thinking is to use a very similar strategy to what's described here (i.e, the virtual synchrony approach), but with group membership replicated via consensus using a CASPaxos implementation.
The next phase, after this, would be to implement support for non-directory services (the default stream partition balancer, reminders). The result should be the ability to separate application code processes from system code processes as desired.

Background

Proposal

Future work

See also: