contiv-experimental/cluster

what should happen when a node disappears and appears again?

mapuri opened this issue · 0 comments

Right now when a node disappears and appears again, clusterm updates it's state in inventory however, no configuration related action is taken on the node. This might not always be desirable as when a node reappears, it might need the services setup again.

This issue explores a few options to handle these scenarios. Feel free to pitch in with comments and feedback.

Following are some scenarios where a node may disappear and then what may happen or be desired subsequently:

  • Scenario 1: node looses network connectivity (i.e. control interface traffic is affected)
    • since node is up all the services will stay running but most likely services will readjust to assume the host to be down.
    • when node appears again, if nothing has changed then node will get added back to cluster
    • But it is not necessarily guaranteed that the service configuration has not changed.
    • There is a need for some sort of config checksum equivalent, that can help here perhaps.
    • Re-setup (after config check) of services will ensure that node is on-boarded back with correct config
  • Scenario 2: node is rebooted (due power-loss or admin action)
    • the services get stopped on the node
    • when the node is booted up again, services need to be re-setup.
    • Re-setup of services will ensure that node is on-boarded back with correct config
  • Scenario 3: just serf is somehow affected (due to crashes or bugs or admin action):
    • first this is a service failure and we may be better off debugging serf itself.
    • since node is up all the services will stay running and also continue to be part of cluster
    • when serf recovers, most likely no configuration action is required.
    • There is a need for some sort of config checksum equivalent, that can help here perhaps.
    • Re-setup of services is most likely not needed but will ensure that node is on-boarded back with correct config.

Note: There is also a time period (especially Scenario 3 above) when node is down in monitoring system but still reachable from service perspective. We will ignore this scenario for now as we may need to debug and fix the condition that caused it instead. Also the side-effect doesn't seem too bad, unless service configuration change is desired during this time.

What follows next is a high level proposal for possible ansible and clusterm enhancements that would address node disappearance and reappearance scenario:

Configuration check (ansible tasks and plays) :

  • We could add config check tasks per service that would verify that service is configured in desired fashion. Some example of this are:
    • etcd:
      • master node could check that etcd service is running in master mode
      • worker node could check that etcd service is running in proxy mode
      • etcd ports are up to date ... and so on
    • contiv_network:
      • master node shall be running netmaster service in addition to netplugin
      • worker node shall be just running netplugin
      • netplugin/master ports are up to date
    • if the checks fail, then a service re-setup shall be triggered (see next section)

Service Re-setup (clusterm discovered event handler):

  • a service re-setup is triggered when a configuration check fails
  • resetup would involve running cleanup.yml followed by regualr provisioning based on node's host-group.