Allow both Clients and Services to work completely offline for a period of time.
erikstmartin opened this issue · 0 comments
In the event of an entire doozer outage (whole cluster is down) we would still be able to run in an offline mode due to the fact we are maintaining internal lists of services.
Services:
The concept would be that on the service side when we notice we have no more instances of doozer to try we mark ourselves in an offline mode and don't send updates to doozer, if we unregister we start hard rejecting traffic.
We retry to connect to doozer at a set interval, and when doozer comes back online we re-register ourselves.
Client:
Clients have a list of services, so they can still use the pool of connections they have, when they notice they have lost all connectivity to doozer, after X failed attempts to a given host:port it will manually remove that instance from it's pool. and the internal instance list so that no new connections are opened to it.
We retry to connect to doozer at a set interval, upon reconnecting we rebuild our internal instance list from scratch, and cleanup any pools that we have open to instances that are not in doozer anymore, or have unregistered themselves.
The important thing to note here is that we want to make sure when doozer comes back online any of our wait() calls, and things like that we get a new revision because if all nodes went down the revision count will start over
This isn't a huge priority, but i think it would be a cool thing to do at some point to further the concept that skynet is built around that: Everything dies.