UCP services fails to start on replicas
rkharya opened this issue · 10 comments
Docker - 1.11.1-cs2
UCP - 1.1.0
Contiv_Cluster - v0.1-05-27-2016.08-26-41.UTC
3 node master setup Docker-2/3/4 got commissioned successfully. UCP master controller came up on Docker-2 successfully.
[cluster-admin@Docker-2 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00c78e666df3 skynetservices/skydns:latest "/skydns" 3 hours ago Up 3 hours 53/tcp, 53/udp defaultdns
84aaf2fbb747 docker/ucp-controller:1.1.0 "/bin/controller serv" 3 hours ago Up 3 hours 0.0.0.0:443->8080/tcp ucp-controller
240f2d9c055b docker/ucp-auth:1.1.0 "/usr/local/bin/enzi " 3 hours ago Up 3 hours 0.0.0.0:12386->4443/tcp ucp-auth-worker
950737ab3d03 docker/ucp-auth:1.1.0 "/usr/local/bin/enzi " 3 hours ago Up 3 hours 0.0.0.0:12385->4443/tcp ucp-auth-api
0a917277e78c docker/ucp-auth-store:1.1.0 "/usr/local/bin/rethi" 3 hours ago Up 3 hours 0.0.0.0:12383-12384->12383-12384/tcp ucp-auth-store
1c1b21800c06 docker/ucp-cfssl:1.1.0 "/bin/cfssl serve -ad" 3 hours ago Up 3 hours 8888/tcp, 0.0.0.0:12381->12381/tcp ucp-cluster-root-ca
057843d79edf docker/ucp-cfssl:1.1.0 "/bin/cfssl serve -ad" 3 hours ago Up 3 hours 8888/tcp, 0.0.0.0:12382->12382/tcp ucp-client-root-ca
6029ace03cad docker/ucp-swarm:1.1.0 "/swarm manage --tlsv" 3 hours ago Up 3 hours 0.0.0.0:2376->2375/tcp ucp-swarm-manager
986415f909ae docker/ucp-swarm:1.1.0 "/swarm join --discov" 3 hours ago Up 3 hours 2375/tcp ucp-swarm-join
14f6374efa0b docker/ucp-proxy:1.1.0 "/bin/run" 3 hours ago Up 3 hours 0.0.0.0:12376->2376/tcp ucp-proxy
58059b6fe7c0 docker/ucp-etcd:1.1.0 "/bin/etcd --data-dir" 3 hours ago Up 3 hours 2380/tcp, 4001/tcp, 7001/tcp, 0.0.0.0:12380->12380/tcp, 0.0.0.0:12379->2379/tcp ucp-kv
5bff22bddf90 quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd
On replica nodes -
[cluster-admin@Docker-3 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
91afc59d3874 quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd
[cluster-admin@Docker-4 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c24052f4398e quay.io/coreos/etcd:v2.3.1 "/etcd" 3 hours ago Up 3 hours etcd
Here is the ucp service status on the failed nodes -
[cluster-admin@Docker-4 ~]$ sudo systemctl status -l -n 100 ucp.service
● ucp.service - Ucp
Loaded: loaded (/etc/systemd/system/ucp.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2016-06-08 15:36:49 IST; 23min ago
Process: 23657 ExecStop=/usr/bin/ucp.sh stop (code=exited, status=0/SUCCESS)
Process: 21913 ExecStart=/usr/bin/ucp.sh start (code=exited, status=1/FAILURE)
Main PID: 21913 (code=exited, status=1/FAILURE)
Jun 08 15:36:35 Docker-4.cisco.com systemd[1]: Started Ucp.
Jun 08 15:36:35 Docker-4.cisco.com systemd[1]: Starting Ucp...
Jun 08 15:36:35 Docker-4.cisco.com ucp.sh[21913]: starting ucp on Docker-4-FLM19379F2F[10.65.122.67]
Jun 08 15:36:36 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Your engine version 1.11.1-cs2 is compatible
Jun 08 15:36:36 Docker-4.cisco.com ucp.sh[21913]: WARN[0000] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
Jun 08 15:36:38 Docker-4.cisco.com ucp.sh[21913]: INFO[0002] All required images are present
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] This engine will join UCP and advertise itself with host address 10.65.122.67 - If this is incorrect, please specify an alternative address with the '--host-address' flag
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Verifying your system is compatible with UCP
Jun 08 15:36:39 Docker-4.cisco.com ucp.sh[21913]: INFO[0000] Checking that required ports are available and accessible
Jun 08 15:36:48 Docker-4.cisco.com ucp.sh[21913]: WARN[0008] unable to login user on auth provider: API Errors: ["INTERNAL_ERROR: An internal server error occurred. Contact a system administrator for more information. - Detail: map[requestID:631f04be-9898-45dc-a332-d2896418c3ee]"]
Jun 08 15:36:48 Docker-4.cisco.com ucp.sh[21913]: FATA[0008] Unexpected error logging in to UCP: 500
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: ucp.service: main process exited, code=exited, status=1/FAILURE
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: docker: "stop" requires a minimum of 1 argument.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: See 'docker stop --help'.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Stop a running container.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Sending SIGTERM and then SIGKILL after a grace period
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: docker: "rm" requires a minimum of 1 argument.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: See 'docker rm --help'.
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Usage: docker rm [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: Remove one or more containers
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-api-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-store-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-store-data
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-worker-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-auth-worker-data
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-client-root-ca
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-cluster-root-ca
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-controller-client-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-controller-server-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-kv
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-kv-certs
Jun 08 15:36:49 Docker-4.cisco.com ucp.sh[23657]: ucp-node-certs
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: Unit ucp.service entered failed state.
Jun 08 15:36:49 Docker-4.cisco.com systemd[1]: ucp.service failed.
[cluster-admin@Docker-3 ~]$ sudo systemctl status -l -n 100 ucp.service
● ucp.service - Ucp
Loaded: loaded (/etc/systemd/system/ucp.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2016-06-08 15:35:24 IST; 25min ago
Process: 23945 ExecStop=/usr/bin/ucp.sh stop (code=exited, status=0/SUCCESS)
Process: 22188 ExecStart=/usr/bin/ucp.sh start (code=exited, status=1/FAILURE)
Main PID: 22188 (code=exited, status=1/FAILURE)
Jun 08 15:35:10 Docker-3.cisco.com systemd[1]: Started Ucp.
Jun 08 15:35:10 Docker-3.cisco.com systemd[1]: Starting Ucp...
Jun 08 15:35:10 Docker-3.cisco.com ucp.sh[22188]: starting ucp on Docker-3-FCH19517CF9[10.65.122.66]
Jun 08 15:35:11 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Your engine version 1.11.1-cs2 is compatible
Jun 08 15:35:11 Docker-3.cisco.com ucp.sh[22188]: WARN[0000] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker
Jun 08 15:35:13 Docker-3.cisco.com ucp.sh[22188]: INFO[0002] All required images are present
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] This engine will join UCP and advertise itself with host address 10.65.122.66 - If this is incorrect, please specify an alternative address with the '--host-address' flag
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Verifying your system is compatible with UCP
Jun 08 15:35:14 Docker-3.cisco.com ucp.sh[22188]: INFO[0000] Checking that required ports are available and accessible
Jun 08 15:35:23 Docker-3.cisco.com ucp.sh[22188]: WARN[0009] unable to login user on auth provider: API Errors: ["INTERNAL_ERROR: An internal server error occurred. Contact a system administrator for more information. - Detail: map[requestID:a8f333f0-4f60-468e-a8a9-f2390f20281e]"]
Jun 08 15:35:23 Docker-3.cisco.com ucp.sh[22188]: FATA[0009] Unexpected error logging in to UCP: 500
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: ucp.service: main process exited, code=exited, status=1/FAILURE
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: docker: "stop" requires a minimum of 1 argument.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: See 'docker stop --help'.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Stop a running container.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Sending SIGTERM and then SIGKILL after a grace period
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: docker: "rm" requires a minimum of 1 argument.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: See 'docker rm --help'.
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Usage: docker rm [OPTIONS] CONTAINER [CONTAINER...]
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: Remove one or more containers
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-api-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-store-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-store-data
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-worker-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-auth-worker-data
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-client-root-ca
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-cluster-root-ca
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-controller-client-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-controller-server-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-kv
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-kv-certs
Jun 08 15:35:24 Docker-3.cisco.com ucp.sh[23945]: ucp-node-certs
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: Unit ucp.service entered failed state.
Jun 08 15:35:24 Docker-3.cisco.com systemd[1]: ucp.service failed.
Restart service on these nodes did not helped.
UCP URL access is failed with same Internal API errors.
Also UCP URL is accessible on host ip(10.65.122.65) as well, apart from VIP(10.65.122.79).
Node commissioning was successful in the sense cluster-manager does report any errors and doesn't cleanup the things. But UCP is running only on 1st node. Rest of the 2 nodes it was in failed state. If i remember correctly the error was different than that which came after re-start.
If i remember correctly the error was different than that which came after re-start.
ok, makes sense. Yeah, ucp restart may have issues (contiv/ansible#217)
Do you happen to have logs from first failure? The journalctl for clusterm should have those logs
Here are those -
Docker3-journal-ucp.txt
Docker4-journal-ucp.txt
journalctl-clusterm.txt
this seems to be the first failure:
Jun 08 12:44:12 Docker-4.cisco.com ucp.sh[11317]: cannot enable tty mode on non tty input
I think removing the -i
option in docker run in ucp script shall fix this but let's wait on @vvb in case he had a reason for having it (https://github.com/contiv/ansible/blob/master/roles/ucp/templates/ucp.j2#L75 and https://github.com/contiv/ansible/blob/master/roles/ucp/templates/ucp.j2#L96)
@mapuri
-i
is required, without it, the docker run
return code is 1
and so it fails. I checked with Uday, and he mentioned that -i
is required.
cannot enable tty mode on non tty input
could be related to -t
hmm, other commands with -t
seem to be fine though