dotnet/orleans

[Idea] Salvage Activations from Defunct Directory Partitions

ReubenBond opened this issue · 10 comments

This is not my idea, but I'm documenting my understanding of it here for discussion so that we can come up with an action plan.

First, a primer on how Orleans finds where a grain activation lives - correct me if I'm wrong:

  • Orleans supports a flexible placement model where the placement of an activation is dynamic and guided by its placement policy rather than being restricted to a fixed calculation.
  • To support this model, Orleans maintains a distributed lookup table of GrainId -> ActivationId, where ActivationId points to the silo which hosts an activation
  • This distributed table is partitioned across all silos in the cluster, where each silo holds one partition of the directory.
  • Silos are placed around conceptual ring based upon a hash of their SiloAddress (similar to nodes in a Chord DHT).
  • In order to find where a grain activation exists in the cluster, its GrainId is mapped to a silo on that ring by finding the silo with the largest hash(SiloAddress) less than the hash(GrainId). This is the Primary Directory Partition for this grain.
  • When a silo is added to the cluster, the directory on each silo is notified and they each perform a hand-off for any grains which have a new primary directory partition (based on the above algorithm).
  • When a silo is removed, similar rebalancing happens so that new activations can be placed on a surviving silo.
  • Each silo maintains a local cache of parts of the distributed table as an optimization, similar to a DNS cache.

That's how a directory partition handles cluster membership changes, but what happens to the actual activations (Grain instances) during a membership change? Currently, if a silo dies, every activation whose primary directory partition was on that silo is eagerly deactivated (see Catalog.SiloStatusChangeNotification). That is, if SiloA has an activation whose primary directory partition is on SiloB and SiloB dies, then SiloA will kill that activation. This can cause a large amount of activation churn, particularly in small clusters.

The proposal is to register those activations with the correct, surviving directory partition instead of deactivating them.

We must maintain a few invariants while implementing this optimization:

  • Activations must eventually converge to at most one per grain.
  • Activations which are tracked using the directory cannot be allowed to exist without being registered in the directory - they cannot be orphaned.
  • Activations must eventually be registered in the correct directory partition.

Are there nuances here which I've missed or is this too vague?

I just want to add that, from what I understood, it only happen if the silo dies. It the silo is stopped properly, it will hand off its grain directory to other silos.

@benjaminpetit, yep, if the silo is gracefully shutdown, then directory handoff will occur before the silo terminates, so it will have zero activations registered to it when it does terminate. This is all for the ungraceful termination case

So, in more concrete terms of what this entails:

  • When membership system notified LocalGrainDirectory that a silo has died, LocalGrainDirectory will ask Catalog which activations it has which are registered on that dead silo.
  • LocalGrainDirectory will then adjust to the new configuration and register those activations with the new owner for that section of the ring.

Does that sound like a fine strategy?

LocalGrainDirectory will then adjust to the new configuration and register those activations with the new owner for that section of the ring.

It will try to register and may fail with that. We need to think through the failure cases carefully. The new owner may not know yet that it is the new owner of the grain ID range.

@sergeybykov thanks :) At what point can we be sure that a remote silo knows it's the owner of a section of the ring?

I'm not sure we have a 100% guaranteed point like that. Membership table is the source of truth, and all silos read it to learn about changes. It is theoretically possible for a single silo to keep failing to read the table for a long time or even forever. In practice, of course, this doesn't happen unless there is a specific network connectivity issue or a bug preventing the table call from succeeding.

@ReubenBond , in general this is a very good idea. We thought about that years ago. We called it "client side replication". As opposite to server side replication, where silos replicate the directory (synchronously or asynchronously) between themselves. The main reason why we never implemented the client side replication is that for a long time there was a wishful thinking, "a promise", for leveraging something else for server side replication (you can imagine what this something else was).
In the current situation there is no reason not to do the client side replication.

Regarding how exactly: there are of course a lot of small details to work through. You will need to decide how available vs. correct (avoiding double activation) you want to be. I would start by asking: lets look at the simplest naive protocol and see what could go wrong? Based on that you will understand what needs to be improved and what are the corner cases to watch. The simplest naive protocol could be: when activation A learns its primary directory partition was lost (it became orphan) it just tries to register again in the directory, without doing anything else special. Even before taking failures of registration into account, you need to reason very cleanly if that would always be correct or not and what needs to be done to make that correct.

BTW, re your question in #2656 (comment): as Sergey wrote, we never had that explicit point in time. BUT, there is something we could do differently to help with that. Current membership and directory do not use the fact that the "Extended membership protocol" provides total order of membership views. It also provides version number for views. We could use that info to order directory mappings. Every membership list will have a monotonically increasing version number. When grain registers it gets back registration with this sequence number. That way if we have 2 registered activations, we can break ties by picking the bigger version.
The main down side will be that we will be taking a strong dependency on the extended membership protocol, whereas so far we did not really required it. Also, it will make porting MBROracle to other Oracles (like SF) harder, since those don't provide total order.
Just an idea.

@benjaminpetit and I looked at leveraging cluster state (membership) version in #2443, but ended up succeeding without it. It's an interesting idea to consider adding membership versions to directory calls.

I had a chat with Sergey about the strategy for this and we identified two basic approaches: push vs pull.

Basic Pull Model

When a silo learns that a silo has been declared dead, it does not terminate any activations which were registered on that silo. Instead, it allows these activations to continue processing normally.

When a directory learns that it has responsibility over a greater portion of the ring, it starts a salvage process as follows:

  • The directory sends all silos a request asking them to re-register activations for its portion of the ring, specifying some token which it generates at the beginning of the process.
  • Those silos then make a request to that directory to re-register activations, specifying the provided token. This is retried until a response has been received or the directory silo has been declared dead.
  • This directory responds to those requests with success/failure notifications for each activation. Those silos destroy activations which failed to be re-registered.
  • This process repeats until it receives re-registration requests from all silos which were active at the starting point & have not been declared dead.

In this model, the window for duplicate activations also exists. It exists from the time any silo learns that a silo has been declared dead and ends when the new owner of that directory partition has received re-registration requests from all silos.

Basic Push Model

When membership detects that a silo has died, it informs the catalog, which determines which activations have lost their registrations and initiates the salvage process to repair the directory.

  • The salvage process repeatedly attempts to re-register each activation with its new target directory:
    • If a target directory does not believe it's the owner of an activation's GrainId, then it rejects with an IncorrectPartition status. The salvage process will retry after a small delay.
    • If an activation has already been registered in the target directory, the target directory rejects the request with a DuplicateActivation status. The salvage process will terminate the activation and remove it from the salvage list.
    • If successful, the activation is removed from the salvage list.
  • This continues until there are no more activations to salvage.

This does not introduce any additional error cases during normal/graceful operation. If a silo is terminated abruptly (eg, network partition, power outage), then there is an increased window for duplicate activations while the salvage operation is running. Once all salvage operations have terminated, there will be no duplicate activations.

Lease-based Adaptations

We can adapt each of these models to require that directories obtain a lease for the range of the ring which they cover. With this adaptation, we can close the window for duplicate activations at the cost of continuous availability. This is mostly useful for supporting Strong Single Activation (#2428), but also helps to tighten the window in general.

In the pull model, directories do not accept new activations until they have completed the salvage process. This reduces the chance of duplicate activations, however there is a window for duplicate activation in the case of a network partition. For example:

  • Silo A hosts an activation for Grain X, registered on Silo B
  • Silo A becomes partitioned
  • Silo B dies and is succeeded by Silo C
  • Grain X is activated on Silo C, while still being active on the (partitioned) Silo A.

In the push model, Strong Single Activation can be achieved by ensuring that every silo continually learns about the lease status of each directory partition. When a partition lease is near expired or has expired, all [StrongSingleActivation] activations in that partition are either destroyed or are paused (can not process turns) until the activation can be re-registered with a successor (new leasee of that partition). With a large number of silos/partitions, it could be expensive to keep silos informed of lease status in a timely fashion (O(N^2)) if each silo messages each directory directly. To reduce that burden, a 'lease cache' SystemTarget can be introduced on (for example) the silo with the lowest address. It can perform the maintenance and reduce the messaging complexity to O(N). There is no need for a single cache - duplicates are fine.

Strong Single Activation can be achieved in other ways (and it's separate from this issue). Probably we can adapt the pull model to work with it also, but it's less obvious to me.