contiv-experimental/cluster

cluster manager shall try and perform a cleanup in case of provision failure

Closed this issue · 4 comments

if ansible fails half way during provisioning, it might be better to perform a cleanup. This will ensure that services don't come up in partial manner

Design considerations:

  • should this be done always or should this be configurable?

Here are a few considerations for addressing this issue:

  1. Create a prechecks role [1] that runs before the actual node playbook. Precheck tasks will check for the availability of network ports, required packages, services, requirements [2], etc..
  2. Roles should provide a basic level of functional/unit testing such as [3] to mitigate services coming up in a partial manner. Enable/disable sanity-check tasks on a per service basis enable_test_ceph | bool and a deployment-wide basis: enable_deployment_tests | bool. For example, ```enable_deployment_tests: true"" for a k8s deployment would run a simple manifest that deploys a web pod and web service, then curls both endpoints.
  3. Trigger a clean-up playbook run such as [4] when a provisioning failure event is detected during the commissioning process.

[1] https://github.com/openstack/kolla/tree/master/ansible/roles/prechecks
[2] https://github.com/openstack/openstack-ansible/blob/master/tests/roles/bootstrap-host/tasks/check-requirements.yml
[3] https://github.com/openstack/kolla/blob/master/ansible/roles/keystone/tasks/check.yml
[4] https://github.com/kubernetes/contrib/blob/master/ansible/playbooks/adhoc/uninstall.yml

@danehans

yes, these are all good points. Actually I have something similar to 3. in mind albeit contiv/ansible already has a cleanup.yml playbook that takes care of cleaning up services and I am planning to just use that.

I like the idea of prechecks role as suggested in 1. (I think it is is very close to what I have been tracking here contiv/ansible#87 but haven't been able to spend time on it). One of the biggest gains with pre-checks is that they can help fail early and we often needn't do an expensive cleanup if prechecks were to fail.

2. is more useful wrt testing ansible roles themselves. Till now we have depended on ansible being exercised in different projects as a way to test it.

wrt to this issue I think we can address the failure handling in phases though i.e. we can start with 3. and as 1. gets added to ansible we can start making use of it in cluster as well without change to user experience.

@mapuri thanks for the feedback. I am getting ready to work on cluster in vagrant. Hopefully I can help you with contiv/ansible#87