contiv-experimental/cluster

Cannot commission node which was used to create exiting cluster - 'build node' commissioning failuers

rkharya opened this issue · 13 comments

Docker - 1.11
UCP - 1.1.0
Ansible/Cluster - clonned on May 16th/"v0.1-05-12-2016.08-27-16.UTC"

[cluster-admin@Docker-1 ansible]$ clusterctl nodes get | egrep "inventory_name|status"
"status": "Unallocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-1-FLM19379EUC", >>>>>>> 'build node'
"status": "Allocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-2-FLM19379EU8",
"status": "Allocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-3-FCH19517CF9",
"status": "Allocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-4-FCH19517CER",
"status": "Allocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-5-FCH19517CAT",
"status": "Allocated",
"prev_status": "Provisioning",
"inventory_name": "Docker-6-FCH1945JJ4F",

Command used to commission -
clusterctl nodes commission Docker-1-FLM19379EUC --host-group=service-worker

[cluster-admin@Docker-1 ansible]$ clusterctl job get last
{
"desc": "commissionEvent: nodes:[Docker-1-FLM19379EUC] extra-vars:{} host-group:service-worker",
"task": "github.com/contiv/cluster/management/src/clusterm/manager.(commissionEvent).(github.com/contiv/cluster/management/src/clusterm/manager.configureOrCleanupOnErrorRunner)-fm",
"status": "Errored",
"error": "exit status 2",
"logs": [
"[DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and ",
"make sure become_method is 'sudo' (default).",
"This feature will be removed in a ",
"future release. Deprecation warnings can be disabled by setting ",
"deprecation_warnings=False in ansible.cfg.",
"",
"PLAY [devtest] *
",
"skipping: no hosts matched",
"",
"PLAY [volplugin-test] ***
",
"skipping: no hosts matched",
"",
"PLAY [cluster-node] ***
***",
"skipping: no hosts matched",
"",
"PLAY [cluster-control] ***
",
"skipping: no hosts matched",
"",
"PLAY [service-master] ***
",
"skipping: no hosts matched",
"",
"PLAY [service-worker] ***
",
"",
"TASK [setup] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [base : include] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [base : include] ***
",
"included: /home/cluster-admin/ansible/roles/base/tasks/redhat_tasks.yml for Docker-1-FLM19379EUC",
"",
"TASK [base : install epel release package (redhat)] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [base : install/upgrade base packages (redhat)] ***
",
"skipping: [Docker-1-FLM19379EUC] =\u003e (item=[u'ntp', u'unzip', u'bzip2', u'curl', u'python-requests', u'bash-completion', u'kernel', u'libselinux-python']) ",
"",
"TASK [base : install and start ntp] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [base : include] ***
",
"included: /home/cluster-admin/ansible/roles/base/tasks/os_agnostic_tasks.yml for Docker-1-FLM19379EUC",
"",
"TASK [base : download consul binary] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [base : install consul] ***
**********************************************",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : check docker version] ***
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : install docker (debian)] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : remove docker (redhat)] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : install docker (redhat)] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : create docker daemon's config directory] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : setup docker daemon's environment] ***
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : setup iptables for docker] ***
**",
"changed: [Docker-1-FLM19379EUC] =\u003e (item=2385)",
"",
"TASK [docker : copy systemd units for docker(enable cluster store) (debian)] *",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : copy systemd units for docker(enable cluster store) (redhat)] *
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : check docker-tcp socket state] *
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : include] ***
",
"included: /home/cluster-admin/ansible/roles/docker/tasks/create_docker_device.yml for Docker-1-FLM19379EUC",
"",
"TASK [docker : pvcreate check for /dev/sdb] ***
",
"fatal: [Docker-1-FLM19379EUC]: FAILED! =\u003e {"changed": true, "cmd": "pvdisplay /dev/sdb", "delta": "0:00:00.107333", "end": "2016-05-18 15:55:35.001004", "failed": true, "rc": 5, "start": "2016-05-18 15:55:34.893671", "stderr": " Failed to find physical volume \"/dev/sdb\".", "stdout": "", "stdout_lines": [], "warnings": []}",
"...ignoring",
"",
"TASK [docker : pvcreate /dev/sdb] ***
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : vgcreate check for /dev/sdb] ***
",
"fatal: [Docker-1-FLM19379EUC]: FAILED! =\u003e {"changed": true, "cmd": "vgdisplay contiv", "delta": "0:00:00.009131", "end": "2016-05-18 15:55:35.439154", "failed": true, "rc": 5, "start": "2016-05-18 15:55:35.430023", "stderr": " Volume group \"contiv\" not found\n Cannot process volume group contiv", "stdout": "", "stdout_lines": [], "warnings": []}",
"...ignoring",
"",
"TASK [docker : vgcreate contiv] ***
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : lvcreate check for /dev/sdb] ***
*******************",
"fatal: [Docker-1-FLM19379EUC]: FAILED! =\u003e {"changed": true, "cmd": "lvdisplay contiv | grep -q dockerthin", "delta": "0:00:00.009585", "end": "2016-05-18 15:55:35.842359", "failed": true, "rc": 1, "start": "2016-05-18 15:55:35.832774", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}",
"...ignoring",
"",
"TASK [docker : lvcreate contiv-dockerthin] ***
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : copy systemd units for docker tcp socket settings] ***
",
"ok: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : reload systemd configuration] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : stop docker] ***
",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : start docker-tcp service] ***
****",
"skipping: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : check docker service state] ***
_",
"changed: [Docker-1-FLM19379EUC]",
" [WARNING]: Consider using 'become', 'become_method', and 'become_user' rather",
"than running sudo",
"",
"TASK [docker : remove the docker key file, if any. It shall be regenerated by docker on restart] __",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : reload docker systemd configuration] _
",
"changed: [Docker-1-FLM19379EUC]",
"",
"TASK [docker : restart docker (first time)] ***
",
"fatal: [Docker-1-FLM19379EUC]: FAILED! =\u003e {"changed": false, "failed": true, "msg": "Warning: Stopping docker.service, but it can still be activated by:\n docker-tcp.socket\nJob for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}",
"...ignoring",
"",
"TASK [docker : ensure docker is started] ***
**************************",
"fatal: [Docker-1-FLM19379EUC]: FAILED! =\u003e {"changed": false, "failed": true, "msg": "Job for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}",
"",
"NO MORE HOSTS LEFT ***
",
"\tto retry, use: --limit @/home/cluster-admin/ansible/site.retry",
"",
"PLAY RECAP ***
****************************",
"Docker-1-FLM19379EUC : ok=26 changed=12 unreachable=0 failed=1 ",
"",
"[DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and ",

@rkharya

can you check systemctl status docker and see if it is hitting the following failure?

unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: cluster-store: (from flag: etcd://localhost:2379, from file: etcd://172.28.225.130:12379)`

If so, then try commissioning the node again and it shall work. Basically UCP 1.1.0 is creating an docker configuration file which causes this error. Please let me know if this is not the case.

It doesn't look to be the UCP issue. It could start due to devicemapper driver failure -

[cluster-admin@Docker-1 ~]$ sudo systemctl status -l docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─env.conf
Active: failed (Result: exit-code) since Wed 2016-05-18 15:55:46 IST; 2 days ago
Docs: https://docs.docker.com
Process: 16732 ExecStart=/usr/bin/docker daemon --storage-opt dm.thinpooldev=/dev/mapper/contiv-dockerthin -H fd:// --cluster-store=etcd://localhost:2379 (code=exited, status=1/FAILURE)
Main PID: 16732 (code=exited, status=1/FAILURE)

May 18 15:55:45 Docker-1.cisco.com systemd[1]: Starting Docker Application Container Engine...
May 18 15:55:45 Docker-1.cisco.com docker[16732]: time="2016-05-18T15:55:45.580173555+05:30" level=info msg="New containerd process, pid: 16738\n"
May 18 15:55:46 Docker-1.cisco.com docker[16732]: time="2016-05-18T15:55:46.608043629+05:30" level=error msg="[graphdriver] prior storage driver "devicemapper" failed: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 18 15:55:46 Docker-1.cisco.com docker[16732]: time="2016-05-18T15:55:46.608162237+05:30" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed.devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed"
May 18 15:55:46 Docker-1.cisco.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
May 18 15:55:46 Docker-1.cisco.com docker[16732]: time="2016-05-18T15:55:46+05:30" level=info msg="stopping containerd after receiving terminated"
May 18 15:55:46 Docker-1.cisco.com systemd[1]: Failed to start Docker Application Container Engine.
May 18 15:55:46 Docker-1.cisco.com systemd[1]: Unit docker.service entered failed state.
May 18 15:55:46 Docker-1.cisco.com systemd[1]: docker.service failed.

hmmm, okay. I haven't seen this before. Was this node commissioned before, I am wondering if docker is running into trouble due to some previous state.

This node was always used for pulling the ansible code and do the discovery & commissioning of other nodes. Since we were having issue #35 , so was waiting on that to complete and i can commission this node as service-worker....Now that #35 is fixed.. this should also work. Docker runs cleanly on this node as i see collins container coming up...

@rkharya

Now that #35 is fixed.. this should also work. Docker runs cleanly on this node as i see collins container coming up...

sorry I might be missing something. Are you confirming that commission of this node is working for you now? Or are you still seeing commission failure?

@rkharya

Now that #35 is fixed.. this should also work. Docker runs cleanly on this node as i see collins container coming up..
sorry I might be missing something. Are you confirming that commission of this node is working for you now? Or are you still seeing commission failure?

No, commissioning is failing. I meant to say, with issue #35 fixed, fist node commissioning should also work, but its not.

@rkharya thanks for clarifying.

Can you try the fix for docker recommended here: moby/moby#6325 (comment)

@mapuri

After removing /var/lib/docker and rebooting the node, commissioning went through successfully. There are few observations here -

  1. Collins container is no more running on node-1, obviously as we cleandup /var/lib/docker. Now will this create further issues, when we do node life-cycle management? Or we want to add new nodes to the existing cluster, will that work without any issues?
  2. How do we handle this hack, which has to be executed manually out of ansible playbook. Do we need to have this hack as a additional task?
  3. This problem seems to be originating due to the devicemapper driver, since when we run ansible playbook for the first time on node-1 to install clusterm/serf etc and docker itself, we do not install docker with 'docker_device'. while node commissioning we do rely on docker_device.. My guess though.

@rkharya

Now will this create further issues, when we do node life-cycle management? Or we want to add new nodes to the existing cluster, will that work without any issues?

yes, if we are using collins as inventory subsystem then you are correct. We don't have this problem with default settings though as we use boltdb for inventory. collins is just there for tests at the moment and we need to think more on how to bring it in production.

. How do we handle this hack, which has to be executed manually out of ansible playbook. Do we need to have this hack as a additional task

I am not sure if we can/should handle it in clusterm or ansible. It looks like docker issue, right? I think docker should be able to handle users trying to restart docker after setting a different storage device. At least a reboot should not have been required, shall we check with docker folks?

vvb commented

@rkharya a change in storage driver or changing the docker_device requires a manual remove the existing /var/lib/docker volume... @mapuri is correct in saying that we should not handle this in ansible,clusterm.

@mapuri @vvb
Agreed on this. How should we take this to Docker? I am not sure about the process. I can talk to Uday on this. Let me know if we need to open an Issue with them.

@rkharya

it would be nice to narrow down the reproduction steps to just docker, i.e. something like:

  • install docker
  • setup storage device
  • restart docker

I think then we can share it with Uday or just file an issue on github docker repo.

@rkharya I will close this as it doesn't seem to be cluster or ansible issue per-se, but feel free to reopen or file a new issue if needed based on follow up with docker folks.