contiv-experimental/cluster

UCP services fails to start on replicas

rkharya opened this issue · 10 comments

Docker - 1.11.1-cs2
UCP - 1.1.0
Contiv_Cluster - v0.1-05-27-2016.08-26-41.UTC

3 node master setup Docker-2/3/4 got commissioned successfully. UCP master controller came up on Docker-2 successfully.

[cluster-admin@Docker-2 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00c78e666df3 skynetservices/skydns:latest "/skydns" 3 hours ago Up 3 hours 53/tcp, 53/udp defaultdns
84aaf2fbb747 docker/ucp-controller:1.1.0 "/bin/controller serv" 3 hours ago Up 3 hours 0.0.0.0:443->8080/tcp ucp-controller
240f2d9c055b docker/ucp-auth:1.1.0 "/usr/local/bin/enzi " 3 hours ago Up 3 hours 0.0.0.0:12386->4443/tcp ucp-auth-worker
950737ab3d03 docker/ucp-auth:1.1.0 "/usr/local/bin/enzi " 3 hours ago Up 3 hours 0.0.0.0:12385->4443/tcp ucp-auth-api
0a917277e78c docker/ucp-auth-store:1.1.0 "/usr/local/bin/rethi" 3 hours ago Up 3 hours 0.0.0.0:12383-12384->12383-12384/tcp ucp-auth-store
1c1b21800c06 docker/ucp-cfssl:1.1.0 "/bin/cfssl serve -ad" 3 hours ago Up 3 hours 8888/tcp, 0.0.0.0:12381->12381/tcp ucp-cluster-root-ca
057843d79edf docker/ucp-cfssl:1.1.0 "/bin/cfssl serve -ad" 3 hours ago Up 3 hours 8888/tcp, 0.0.0.0:12382->12382/tcp ucp-client-root-ca
6029ace03cad docker/ucp-swarm:1.1.0 "/swarm manage --tlsv" 3 hours ago Up 3 hours 0.0.0.0:2376->2375/tcp ucp-swarm-manager
986415f909ae docker/ucp-swarm:1.1.0 "/swarm join --discov" 3 hours ago Up 3 hours 2375/tcp ucp-swarm-join
14f6374efa0b docker/ucp-proxy:1.1.0 "/bin/run" 3 hours ago Up 3 hours 0.0.0.0:12376->2376/tcp ucp-proxy
58059b6fe7c0 docker/ucp-etcd:1.1.0 "/bin/etcd --data-dir" 3 hours ago Up 3 hours 2380/tcp, 4001/tcp, 7001/tcp, 0.0.0.0:12380->12380/tcp, 0.0.0.0:12379->2379/tcp ucp-kv
5bff22bddf90 quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd

On replica nodes -

[cluster-admin@Docker-3 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
91afc59d3874 quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd

[cluster-admin@Docker-4 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c24052f4398e quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd

Here is the ucp service status on the failed nodes -

[cluster-admin@Docker-4 ~]$ sudo systemctl status -l -n 100 ucp.service
● ucp.service - Ucp
Loaded: loaded (/etc/systemd/system/ucp.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2016-06-08 15:36:49 IST; 23min ago
Process: 23657 ExecStop=/usr/bin/ucp.sh stop (code=exited, status=0/SUCCESS)
Process: 21913 ExecStart=/usr/bin/ucp.sh start (code=exited, status=1/FAILURE)
Main PID: 21913 (code=exited, status=1/FAILURE)

Jun 08 15:36:35 Docker-4.cisco.com systemd[1]: Started Ucp.
Jun 08 15:36:35 Docker-4.cisco.com systemd[1]: Starting Ucp...
Jun 08 15:36:35 Docker-4.cisco.com ucp.sh[21913]: starting ucp on Docker-4-FLM19379F2F[10.65.122.67]
Jun 08 15:36:36 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Your engine version 1.11.1-cs2 is compatible
Jun 08 15:36:36 Docker-4.cisco.com ucp.sh[21913]: WARN[0000] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
Jun 08 15:36:38 Docker-4.cisco.com ucp.sh[21913]: INFO[0002] All required images are present
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] This engine will join UCP and advertise itself with host address 10.65.122.67 - If this is incorrect, please specify an alternative address with the '--host-address' flag
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Verifying your system is compatible with UCP
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Checking that required ports are available and accessible
Jun 08 15:36:48 Docker-4.cisco.com ucp.sh[21913]: WARN[0008] unable to login user on auth provider: API Errors: ["INTERNAL_ERROR: An internal server error occurred. Contact a system administrator for more information. - Detail: map[requestID:631f04be-9898-45dc-a332-d2896418c3ee]"]
Jun 08 15:36:48 Docker-4.cisco.com ucp.sh[21913]: FATA[0008] Unexpected error logging in to UCP: 500
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: ucp.service: main process exited, code=exited, status=1/FAILURE
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: docker: "stop" requires a minimum of 1 argument.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: See 'docker stop --help'.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Stop a running container.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Sending SIGTERM and then SIGKILL after a grace period
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: docker: "rm" requires a minimum of 1 argument.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: See 'docker rm --help'.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Usage: docker rm [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Remove one or more containers
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-api-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-store-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-store-data
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-worker-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-worker-data
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-client-root-ca
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-cluster-root-ca
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-controller-client-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-controller-server-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-kv
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-kv-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-node-certs
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: Unit ucp.service entered failed state.
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: ucp.service failed.

[cluster-admin@Docker-3 ~]$ sudo systemctl status -l -n 100 ucp.service
● ucp.service - Ucp
Loaded: loaded (/etc/systemd/system/ucp.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2016-06-08 15:35:24 IST; 25min ago
Process: 23945 ExecStop=/usr/bin/ucp.sh stop (code=exited, status=0/SUCCESS)
Process: 22188 ExecStart=/usr/bin/ucp.sh start (code=exited, status=1/FAILURE)
Main PID: 22188 (code=exited, status=1/FAILURE)

Jun 08 15:35:10 Docker-3.cisco.com systemd[1]: Started Ucp.
Jun 08 15:35:10 Docker-3.cisco.com systemd[1]: Starting Ucp...
Jun 08 15:35:10 Docker-3.cisco.com ucp.sh[22188]: starting ucp on Docker-3-FCH19517CF9[10.65.122.66]
Jun 08 15:35:11 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Your engine version 1.11.1-cs2 is compatible
Jun 08 15:35:11 Docker-3.cisco.com ucp.sh[22188]: WARN[0000] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
Jun 08 15:35:13 Docker-3.cisco.com ucp.sh[22188]: INFO[0002] All required images are present
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] This engine will join UCP and advertise itself with host address 10.65.122.66 - If this is incorrect, please specify an alternative address with the '--host-address' flag
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Verifying your system is compatible with UCP
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Checking that required ports are available and accessible
Jun 08 15:35:23 Docker-3.cisco.com ucp.sh[22188]: WARN[0009] unable to login user on auth provider: API Errors: ["INTERNAL_ERROR: An internal server error occurred. Contact a system administrator for more information. - Detail: map[requestID:a8f333f0-4f60-468e-a8a9-f2390f20281e]"]
Jun 08 15:35:23 Docker-3.cisco.com ucp.sh[22188]: FATA[0009] Unexpected error logging in to UCP: 500
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: ucp.service: main process exited, code=exited, status=1/FAILURE
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: docker: "stop" requires a minimum of 1 argument.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: See 'docker stop --help'.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Stop a running container.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Sending SIGTERM and then SIGKILL after a grace period
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: docker: "rm" requires a minimum of 1 argument.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: See 'docker rm --help'.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Usage: docker rm [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Remove one or more containers
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-api-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-store-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-store-data
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-worker-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-worker-data
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-client-root-ca
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-cluster-root-ca
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-controller-client-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-controller-server-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-kv
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-kv-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-node-certs
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: Unit ucp.service entered failed state.
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: ucp.service failed.

Restart service on these nodes did not helped.
UCP URL access is failed with same Internal API errors.
screen shot 2016-06-08 at 4 02 00 pm

Also UCP URL is accessible on host ip(10.65.122.65) as well, apart from VIP(10.65.122.79).

might be related to the recent certs change. ping @vvb

@rkharya
Was this error seen on first commission attempt itself or after UCP restart?

Node commissioning was successful in the sense cluster-manager does report any errors and doesn't cleanup the things. But UCP is running only on 1st node. Rest of the 2 nodes it was in failed state. If i remember correctly the error was different than that which came after re-start.

If i remember correctly the error was different than that which came after re-start.

ok, makes sense. Yeah, ucp restart may have issues (contiv/ansible#217)

Do you happen to have logs from first failure? The journalctl for clusterm should have those logs

this seems to be the first failure:

Jun 08 12:44:12 Docker-4.cisco.com ucp.sh[11317]: cannot enable tty mode on non tty input

I think removing the -i option in docker run in ucp script shall fix this but let's wait on @vvb in case he had a reason for having it (https://github.com/contiv/ansible/blob/master/roles/ucp/templates/ucp.j2#L75 and https://github.com/contiv/ansible/blob/master/roles/ucp/templates/ucp.j2#L96)

vvb commented

@mapuri
-i is required, without it, the docker run return code is 1 and so it fails. I checked with Uday, and he mentioned that -i is required.
cannot enable tty mode on non tty input could be related to -t

hmm, other commands with -t seem to be fine though

vvb commented

@rkharya @mapuri looked at the logs, the only difference I see between what I had tested and this is the docker version - I was using 1.11 and here we are using 1.11.1-cs2
Let me test with 1.11.1-cs2 and get back.

vvb commented

@rkharya this is fixed, so closing this now. please re-open if you notice otherwise.