Rationalize the nature of bindings in the context of the lattice

Question

Rationalize the nature of bindings in the context of the lattice

autodidaddict opened this issue 4 years ago · 6 comments

The original concept of bindings was created for an isolated, single-process system that had no knowledge of a lattice. Right now, the way we're handling bindings isn't quite in accordance with the kind of functionality we want to support in the lattice. There are a number of things that we need to do in order to bring bindings up to par with lattice functionality:

Rename the host's bind_actor method to set_binding (this will overwrite any previously existing binding for the triplet of actor-provider-provider_instance_name (e.g. "default")
Add a remove_binding function (there isn't one today, only remove_actor)
(Implementation Detail) - All bindings must be shared knowledge among every member of the lattice. This means a couple of things:
- Each host will queue subscribe to the inventory query to obtain bindings, since that will no longer be a scatter-gather operation. Each host can answer with the complete list of bindings
- When a binding is created, from any host, regardless of where the actor and capability are located, the bus holding the actor in question will subscribe to the private subject for that actor<->capability binding.
- When a binding is removed, from any host, regardless of where the actor and capability are located, the bus holding the actor in question will unsubscribe from the private subject for that actor<->capability binding.
When an actor is removed, the bus will unsubscribe from all pertinent actor subjects (including all of its bound providers), but the binding information will not be removed
When a capability provider is removed, all bound actors will unsubscribe from all relevant subjects, but the binding data will remain in the lattice.
When an actor is added to a host, the bus will be checked for relevant bindings, and the actor will subscribe to all private comms subjects for each discovered binding involving that actor.
When a capability provider is added to a host, it will receive OP_BIND_ACTOR messages for all of its existing bindings, allowing it to provision all appropriate resources and establish communications on the right bus subject. This will close issue #46

Acceptance Criteria

To verify that this is working as planned:

All previous non-lattice tests must continue to work. The host needs to work as it always has when lattice mode isn't enabled.
With lattice enabled, we should be able to perform the following test:
- Create 2 hosts with no actors and no capability providers
- Add an actor to each host
- Add a provider to each host
- Call bind_actor from either host to establish actor-provider bindings (2 bindings)
- Remove actor from host 1 (binding should remain in the lattice)
- Add actor to host 2 (binding should not duplicate, should not be re-initialized by provider, actor should subscribe to right topic, and we can make a call on the provider that will talk to the actor
- Remove actor 2 from host 2
- Add actor 2 to host 1
- Verify actor 2 continues to function normally
- Remove provider 1 from host 1
- Add provider 1 to host 2
- Verify that all of provider 1's bound actors continue to work and that it responds to all of its appropriate stimuli
- Call remove_binding for actor 1-provider 1. Assuming this provider is an HTTP server, after this binding removal, the actor should still be in the lattice, but attempting to curl the previously bound port should give a no response error because the removal of that binding should have shut down the relevant resources.

Answer 1 · 2020-09-15T13:02:21.000Z

Thoughts on the new flow:

During the set_binding call
1. If the binding differs from what the bus already sees as existing, the host will tell the bus to publish a BindingSet event (which will trigger cache updates in all listening hosts)
2. If the binding differs from what the bus already sees as the existing binding, the bus will deliver the OP_BIND_ACTOR message to the named provider instance
During the remove_binding call
1. Send the OP_REMOVE_ACTOR message to the named provider instance
2. Publish a BindingRemoved event on the bus (which will trigger cache updates in all listening hosts)
When an actor is removed from a host
1. Unsubscribe all of that actor's active bus subscriptions, remove local claims, etc.
When a capability provider is removed from a host
1. Unsubscribe all of that provider's active subscriptions
When an actor is added to a host
1. Subscribe to all actor-provider private topics based on the bus's awareness of existing bindings
When a capability provider is added to a host
1. Receive an OP_BIND_ACTOR message (locally!) for each of the bindings known for that provider. (Thoughts: I worry that bypassing the normal bus mechanisms to attempt local-only delivery of messages could ultimately be a source of skew over time and/or split-brain syndrome. It might be a bigger burden on providers, but it might also be far easier to simply require providers to enforce idempotency on all binding calls - e.g. politely do nothing for redundant binds. This would make 3 instances of the same provider ignore the dupe message but the 4th newly reconstituted provider see it is a new binding and provision the appropriate resources.)
The bus will always be listening for BindingSet and BindingRemoved events, which it will use to maintain a local cache of known bindings, which is what will be used to respond to inventory queries made by lattice clients or lattice members attempting to restore binding data to restarting actors or providers.

This can obviously create scenarios where actors will get timeout failures when attempting to communicate with non-existent providers even though their bindings exist (old system would've removed the binding so the RPC call would fail immediately due to lookup failure). I think I'm okay with that, as some other entity should be able to monitor the system and attempt to ensure that there's always enough provider instances for actors, etc.

Answer 2 · 2020-09-15T13:03:34.000Z

@ewbankkit @rylev @bacongobbler @brooksmtownsend any thoughts on this? I'm looking for edge cases where this new world where the lattice maintains a distributed cache of known bindings falls down.

Answer 3 · 2020-09-15T13:23:46.000Z

🤔 I might be overthinking this entire thing. If providers can be trusted to safely ignore "re-bind" operations, then we might be able to boil this down to:

When an actor starts, publish OP_BIND_ACTOR for all its existing bindings
When a capability provider starts, publish OP_BIND_ACTOR for all its existing bindings
When a binding is set, publish the BindingSet event and OP_BIND_ACTOR for the new binding
When a binding is removed, publish BindingRemoved event and OP_REMOVE_ACTOR to all matching providers

🤔

Answer 4 · 2020-09-15T14:20:41.000Z

Further thoughts: what are the tradeoffs between having a "binding service" that each host queries in order to get updated binding data versus having each host maintain a cache? Off the top of my head, the biggest two that bug me:

For a lattice with hundreds of actors and hundreds of capability bindings, that data overhead needs to be maintained by all of the lattice hosts. It's probably not more than 100KB of consumption per host, but it still could be considered wasteful.
For portions of the lattice that exist on the edge or at endpoints beyond like a raspberry pi deployed in the field, that device needs to maintain pretty reliable and constant contact with the lattice in order to function properly. If it simply queried the lattice for bindings, and the closest binding service responded to the query, then the host on that device would need only concern itself with bindings that are immediately relevant to it.
A binding service could be a SPOF (single point of failure), but running on a lattice we could deploy multiple binding services in multiple leaf cells to reduce traffic and increase resiliency
if a host only queries binding data when it is necessary for either an actor or a provider being loaded into the host, then its memory only ever has the configuration relevant for the provider running in-process. In the "distributed cache" model, all hosts contain all data, so compromising the memory space of a host in that model has a much bigger blast radius.

? 🤔

Answer 5 · 2020-09-15T18:32:04.000Z

Even further thoughts. In a scaled situation, we can conceivably have two instances of the same provider running in the lattice. If this is an HTTP provider, and we have 2 different actors, we need to be able to tell both of those instances to spin up the appropriate resources for each of the unique actors. In other words, actors must be able to scale on their own, on demand, and providers must be able to scale on their own, on demand. When a provider scales, it must be able to accept the same binding information that other instances previously accepted.

I think a potential solution to all of the various highly complex solutions in previous comments is to go the auction route: To establish a binding between a group of n actor instances and y provider instances, we hold an auction. The instigator (e.g. a lattice client or the host API) publishes the auction request for the binding and then the first host to respond affirmatively to that will be issued a control command to establish that binding (this will ultimately result in the OP_BIND_ACTOR invocation being performed on the provider residing in that host). After the first auction, the binding in this situation will be partially applied because not all of the y provider instances are bound. To reach a state of being fully bound, the binding auction can take place y-1 more times, until all of the available providers have accepted the binding.

Some concrete examples:

We want 4 instances of actor Mxxxx, 2 instances of the HTTP server provider. 2 successful binding auctions would be required for actor Mxxx to be fully bound to the provider. After this, 2 of the wascc-hosts would have a viable HTTP endpoint running on port PORT which will deliver requests randomly to each of the 4 actor instances.
We want 5 instances of actor Mxxx, and 4 instances of the message broker provider. 4 successful binding auctions would bring this actor to being fully bound to the broker provider. Each of the 4 bound providers would have identical subscriptions and each of them would deliver messages to the group of 5 instances of the actors. This means that ops/developers would need to be careful about choosing regular or queue subscriptions - A queue subscription would have the 4 bound providers split the subscribed message delivery 4 ways.

Some benefits:

In an auction model, there is no replicated state, dramatically reducing a development burden and a huge potential source of bugs and runtime inconsistencies
- No replicated state also means no sensitive configuration data for off-host providers would be stored
"Leaf-local" support - using a leaf node, you can have in-leaf responders reply to auction requests first because they will receive them first. This can be done automatically through the use of leaf nodes, with no work needing to be done by operations or developers
Explicit scale control - if you want all running instances (that are responding to auctions) to be bound to the actor group, simply hold n auctions.
When providers or actors go away, they simply unsubscribe.

Potential problems:

How does a host loading a provider know to re-hold n auctions ? If it just holds an inventory query it can then simply hold an auction.. but this brings up another question - do we want to support operations where a binding isn't full? In other words, is it ever considered "good" if an actor should be bound to wascc:messaging, and there are 3 instances of wascc:messaging, and only one of them has an active binding? I'm leaning toward no, that an idle / unbound capability provider is a wasted resource - either get rid of the provider (which would bring the binding to full) or auction for more bindings.
This model doesn't allow for automatically reconstituting a provider's bindings once it has been completely drained from the lattice. This might be acceptable, given we intend to build higher-level abstractions for #67 , so something else will have access to persistent deployment declarations and can hold new auctions.

Answer 6 · 2020-09-16T13:19:23.000Z

New proposal providing what could be a more stable, iterative foundation:

Action	Host Impact	Lattice/Local Bus Impact
API `add_actor`	Module loaded, listener thread started	Subscribes to actor topic
API `remove_actor`	Module removed, thread terminated	Unsubscribes from actor topic. If the actor being removed is the last of its instances in the lattice, it will call `OP_REMOVE_ACTOR` to unbind from the cap provider
API `add_capability`	Plugin loaded, listener thread started	Subscribe to capability main topic. Query bus for any existing bindings and re-subscribe to those topics
API `remove_capability`	Plugin unloaded, thread terminated	Unsubscribe from all cap topics
API `set_binding`	None	`OP_BIND_ACTOR` invoked on ALL matching caps in lattice (not random via queue subscribe)
API `remove_binding`	None	`OP_REMOVE_ACTOR` invoked on ALL matching caps in lattice (not random via queue subscribe)
Lattice schedule actor	None	Auction held, actor bytes downloaded from Gantry, actor started. No affect on bindings
Lattice stop actor	None	Specific host is told to terminate an actor. Identical to host's `remove_actor`. Only impacts a single instance of an actor
Lattice set binding	None	`OP_BIND_ACTOR` invoked on ALL matching caps in lattice. Identical to host API call
Lattice remove binding	None	`OP_REMOVE_ACTOR` invoked on ALL matching caps in lattice. Identical to host API call
Lattice Add Capability	N/A	Unsupported until Gantry supports the storage / retrieval of cap providers

This should produce the following high-level behaviors:

An actor can scale above 1 and down to 1 instance without requiring a re-set of its bindings. Scale to 0 will drop all bindings between that actor and all of its providers.
A capability provider can scale above 1 and down to 1 instance without requiring a re-set of its bindings. Scaling it to 0 will force the lattice to "forget" all bindings between the provider and actors
Setting a binding must be an idempotent operation (providers cannot crash on re-set). It will be up to the individual providers whether they ignore a re-bind of a running instance containing different values or whether they accept it and reconfigure accordingly.
Setting and removal of bindings are done lattice-wide, affecting change to all instances of capability providers
Lattice does not maintain distributed cache of bindings. Bindings are either active in the lattice, or they are not. It is the responsibility of the entity doing scheduling to ensure that after scale from 0 to 1, bindings are created.

Multiplicity of Bindings

In the above described feature, bindings will expand to fill the space they are given. If you are running 9 instances of a single named capability provider (e.g. wascc:http_server,default, wascc:messaging,foobar) then every bound actor group will be bound to all 9 instances of that provider. If you run 3 actor groups that all need bindings to the default message broker, each of those 3 groups will be bound to each of the 9 instances of the provider and you cannot sub-divide by giving more or less instances to specific actor groups.

This can have consequences developers need to be aware of. For example, if you bind an actor group to a message broker that is using a straight subscription and not a queue subscription, then each message from a subscription will be delivered to a random actor within the group n times, once for each running provider instance. If you don't want duplicates, you'll need to use a queue subscribe.