mantl/mesos-consul

Connect to Zookeeper fails

tnolet opened this issue · 20 comments

Hi,

getting this error:

2015/10/27 11:29:31 Connected to mesos01.bva.nu:2181
2015/10/27 11:29:31 Authenticated: id=94756170276864056, timeout=40000
2015/10/27 11:31:31 [ERROR] Timed out waiting for initial ZK detection

Seems like it can connect and then later times out. Any ideas?

The connect seems to be working. The timeout is caused by the zookeeper detector not finding a Mesos master in Zk. Can you log in to zookeeper and run ls /mesos and see if there are any json.info_000000 or info_000000 nodes?

@ChrisAubuchon Sure, here you go. The info is right there. The mesos cluster itself is also performing fine, so no real Zookeeper issues on that side.

root@mesos01:~# /usr/share/zookeeper/bin/zkCli.sh 
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /mesos
[json.info_0000000025, json.info_0000000026, log_replicas]
[zk: localhost:2181(CONNECTED) 1] 

What version of mesos-consul are you running?

@ChrisAubuchon I'm running it from this Docker container: ciscocloud/mesos-consul:v0.2.1

Okay. That's the issue. v0.2.1 doesn't support Mesos 0.24. There is a new container ciscocloud/mesos-consul:v0.3 that supports Mesos 0.24.

@ChrisAubuchon I switched to v0.3. but get exactly the same. This is the log straight from the container:

root@mesos03:~# docker ps -a
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS                        PORTS               NAMES
be8acf83bbf8        ciscocloud/mesos-consul:v0.3     "/bin/mesos-consul --"   7 minutes ago       Exited (1) 5 minutes ago                          mesos-9d6e3ad9-5edc-42a4-bc2f-8035ece472a1-S0.268eb80b-1f51-43ea-b58d-985e99112237
root@mesos03:~# docker logs be8acf83bbf8
2015/10/27 14:23:10 Connected to mesos01.bva.nu:2181
2015/10/27 14:23:10 Authenticated: id=94756170276864062, timeout=40000
time="2015-10-27T14:25:10Z" level=fatal msg="Timed out waiting for initial ZK detection." 
root@mesos03:~# 

What is the zookeeper address that you are passing to the container? zk://mesos01.bva.nu:2181/mesos?

@ChrisAubuchon Yes, but append with another zookeeper in the quorum. This is the exact app definition I deploy to Marathon. Of course, the hosts are resolvable as the connection is initially made if I interpret the logs correctly.

{
  "args": [
    "--zk=zk://mesos01.bva.nu:2181/mesos,zk://mesos02.bva.nu:2181/mesos"
  ],
  "container": {
    "type": "DOCKER",
    "docker": {
      "network": "HOST",
      "image": "ciscocloud/mesos-consul:v0.3"
    }
  },
  "id": "mesos-consul",
  "instances": 1,
  "cpus": 0.1,
  "mem": 256
}

Ah! mesos-consul doesn't currently support a list of zookeeper servers. Try taking out the second zookeeper server. I think it's trying to read the /mesos,zk://mesos02.bva.nu:2181/mesos node inside of zk.

@ChrisAubuchon Aha! Ok, well that mean I can't use it at all in production. We need a Zookeeper quorum to keep the state of Mesos. Connecting to just one of the members for state kinda defeats the whole purpose of having Mesos, Zookeeper, Consul etc. What's your opinion?

@ChrisAubuchon It works with just one ZK instance. Lets mark this as solved for now. I'll see if can find some time to contribute a patch for this. Also multi-port services is a thing we need...

@tnolet I think that the zookeeper detector CAN use multiple zookeeper servers but in a weird format. Try --zk=zk://mesos01.bva.nu:2181,mesos02.bva.nu:2181/mesos

@ChrisAubuchon amazingly, that works. Although if one of the servers fails (I made a stupid typo) it doesn't try the other one...

@tnolet Is that at startup? I've tried an unreachable host, bad host name and a host not running zk and it is failing over correctly...

@ChrisAubuchon yes, this was at startup. But the error was that the /mesos key was not there. This was my error. However, this should of course still trigger a failover to the other ZK instance. Not a big issue though.

I'm having this very same error, but the fix above to the --zk argument did not fix my issue.

[root@master1 ~]# mesos-consul --zk=zk://192.168.10.145:2181/mesos --log-level=DEBUG --consul --refresh=5s
INFO[0000] Using zookeeper: zk://192.168.10.145:2181/mesos
DEBU[0000] Zookeeper address                             zk=zk://192.168.10.145:2181/mesos
2016/04/11 20:17:26 Connected to 192.168.10.145:2181
2016/04/11 20:17:26 Authenticated: id=95710752094158869, timeout=40000
FATA[0120] Timed out waiting for initial ZK detection.

[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.

^]
telnet> close
Connection closed.
[root@master1 ~]#
[root@master1 ~]# mesos-consul --version
mesos-consul v0.3.1

[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.
ruok
imokConnection closed by foreign host.
[root@master1 ~]# telnet 192.168.10.145 2181
Trying 192.168.10.145...
Connected to 192.168.10.145.
Escape character is '^]'.
status
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Clients:
 /192.168.10.149:46038[1](queued=0,recved=369,sent=369)
 /192.168.10.148:56186[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50194[1](queued=0,recved=222,sent=222)
 /192.168.10.146:53431[1](queued=0,recved=356,sent=356)
 /192.168.10.119:33931[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56184[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53429[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53432[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50038[1](queued=0,recved=2174,sent=2177)
 /192.168.10.145:50034[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56185[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50037[1](queued=0,recved=356,sent=356)
 /192.168.10.120:44014[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37536[1](queued=0,recved=356,sent=356)
 /192.168.10.121:49779[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53430[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46032[1](queued=0,recved=422,sent=424)
 /192.168.10.147:37533[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46034[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50035[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46036[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50036[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37531[1](queued=0,recved=711,sent=713)
 /192.168.10.149:46035[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56187[1](queued=0,recved=356,sent=356)
 /192.168.10.118:34270[1](queued=0,recved=356,sent=356)
 /192.168.10.148:56182[1](queued=0,recved=709,sent=711)
 /192.168.10.115:42196[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37535[1](queued=0,recved=356,sent=356)
 /192.168.10.147:37534[1](queued=0,recved=356,sent=356)
 /192.168.10.149:46039[1](queued=0,recved=357,sent=357)
 /192.168.10.145:50445[0](queued=0,recved=1,sent=0)
 /192.168.10.149:46037[1](queued=0,recved=356,sent=356)
 /192.168.10.145:50033[1](queued=0,recved=356,sent=356)
 /192.168.10.146:53433[1](queued=0,recved=711,sent=713)
 /192.168.10.117:45342[1](queued=0,recved=356,sent=356)

Latency min/avg/max: 0/6/3725
Received: 17391
Sent: 17400
Connections: 36
Outstanding: 0
Zxid: 0x200000031
Mode: follower
Node count: 30
Connection closed by foreign host.
[root@master1 ~]#

[root@master1 ~]# consul version
Consul v0.6.4
Consul Protocol: 3 (Understands back to: 1)

[root@master1 ~]# mesos-master --version
mesos 0.28.0

@cicide Can you log into zookeeper and run ls /mesos?

Looks like you spotted it.

/opt/mesosphere/zookeeper/bin/zkCli.sh
Connecting to localhost:2181
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /mesos
Node does not exist: /mesos

Strange, when setting up mesos and zookeeper, when is /mesos created? I would have expected to see some errors from mesos-master or mesos-slave.

@ChrisAubuchon

a bad mesos zk file was the problem. It caused an invalid name to be used in zookeeper for what should have been /mesos. Thanks for pointing me in the right direction

need to document in wiki