contiv-experimental/cluster

Proposal: multi-node operation support

Closed this issue · 3 comments

Addresses #86

Right now clusterm handles requests and events for just one node at a time. This proposal details requirements and changes for adding multi-node operation support.

Desired Behavior

There are two situations where multi node support needs to be provided viz. user requests and monitor events

  • user requests: these are REST endpoints that a clusterm's client uses to perform node level operations. Following operations/endpoints are required:
    • info/nodes:
      • This endpoint exists today and returns the info of all the nodes in the inventory.
      • This endpoint shall be extended to take names of one or more nodes and return info of the specified a subset.
      • A node that doesn't exist is silently ignored.
    • commission/nodes, decommission/nodes, maintenance/nodes:
      • these are new endpoints that shall commission, decommission or upgrade , respectively, a subset of nodes
      • these endpoints shall take a list of node names
      • if one or more nodes doesn't exists the entire request shall fail
      • the specified operation is performed on all nodes together, with following behavior on failure:
        • commission: tries to cleanup all nodes in case of provision failure. User can then rectify the failure and re-post the same request.
        • decommission: cleanup never fails and is best effort
        • maintenance: a correct desired behavior is yet to be defined. In current implementation the nodes are transitioned to unallocated state but no other action is taken.
    • discover/nodes:
      • this is a new endpoint that provisions a set of nodes for discovery
      • this endpoint shall take a list of node addresses
      • if provisioning fails for one or more nodes, then user will need to correct the failure and post this request on failed nodes again
  • monitor events: these are the node level events generated by the monitor subsystem
    • discovered: this updates the inventory state of a node. And no provision action is associated with this event yet.
      • this event shall be changed to accept a list of nodes
    • disappeared: this updates the inventory state of a node. And no provision action is associated with this event yet.
      • this event shall be changed to accept a list of nodes

UX considerations

  • REST API:
    • GET info/nodes
      • request body: { nodes: [] }
        • empty list shall return info about all nodes
      • response body: no change
      • response codes: no change
    • POST commission/nodes
      • request body: { nodes: [] }
      • response body: empty
      • response codes:
        • 200:
          • no errors and ansible run started successfully
        • 500:
          • backend validation failure, OR
          • empty list or one or more non existent nodes shall return http error 500
    • POST decommission/nodes
      • request body: { nodes: [] }
      • response body: empty
      • response codes:
        • 200:
          • no errors and ansible run started successfully
        • 500:
          • backend validation failure, OR
          • empty list or one or more non existent nodes were specified
    • POST maintenance/nodes
      • request body: { nodes: [] }
      • response body: empty
      • response codes:
        • 200:
          • no errors and ansible run started successfully
        • 500:
          • backend validation failure, OR
          • empty list or one or more non existent nodes were specified
    • POST discover/nodes
      • request body: { addressess: [] }
        • empty list or one or more invalid addresses were specified
      • response body: empty
      • response codes:
        • 200:
          • no errors and ansible run started successfully
        • 500:
          • backend validation failure, OR
          • empty list or one or more invalid addresses were specified
  • CLI:
    • get nodes info:
      • clusterctl nodes get [< node-name(s) >]
    • commission nodes:
      • clusterctl nodes commission [--extra-vars=< extra-vars >] < node-name(s) >
    • decommission nodes:
      • clusterctl nodes decommission [--extra-vars=< extra-vars >] < node-name(s) >
    • upgrade nodes:
      • clusterctl nodes maintenance [--extra-vars=< extra-vars >] < node-name(s) >

System test considerations

  • test the success scenario with multiple nodes
  • test the 500 error condition paths
  • test cleanup on failure of one or more nodes

/cc @vvb for proposal review

/cc @vishal-j for UX

What do you think about passing ansible variables as json in request body for commissioning and discovery?

hmmm, yeah I think we can add them to the body as well.

Till now I had kept them as query-parameters as they are optional i.e. empty value of extra-vars still implies a valid value while value of extra-vars being absent implies don't do anything. extra-vars are merged (i.e. global level and per request level) so this could make a subtle difference in logic implementation which I need to re-verify. And till now we didn't had a precedent for request body.

I need to think more but most likely we can get the current behavior with extra-vars in request-body as well. This will keep the requests consistent, which is good.

Since this will also affect existing single node APIs as let me track this as a separate issue so we can bring this change in for all APIs together. Does this works?

Yes. Thanks.