kudobuilder/operators

ZooKeeper fails because of catch-22 DNS lookup against pods which are not running.

dick-twocows opened this issue · 1 comments

The bootstrap script creates the /conf/zoo.cfg with server.n entries, however the pods are yet to be running so k8s will never create endpoints and thus the DNS will not resolve and the pods will be stuck in CrashLoopBackOFF.

As the Zookeeper startup will never resolve the other servers it will crash loop.

It is my understanding that Services/Endpoints are updated when the Pods are running.

server.1=zookeeper-instance-zookeeper-0.zookeeper-instance-hs.default.svc.cluster.local:2888:3888
server.2=zookeeper-instance-zookeeper-1.zookeeper-instance-hs.default.svc.cluster.local:2888:3888
server.3=zookeeper-instance-zookeeper-2.zookeeper-instance-hs.default.svc.cluster.local:2888:3888

ZooKeeper pods are stuck

❯ k get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-77cfbb848f-8qdh8 1/1 Running 3 27h
nginx-deployment-77cfbb848f-t7gtj 1/1 Running 3 27h
zookeeper-instance-zookeeper-0 0/1 CrashLoopBackOff 6 14m
zookeeper-instance-zookeeper-2 0/1 CrashLoopBackOff 6 14m
zookeeper-instance-zookeeper-1 0/1 CrashLoopBackOff 6 14m

Logs for pod
❯ k logs zookeeper-instance-zookeeper-0 | more
Zookeeper configuration...

clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/logs
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInteval=12

server.1=zookeeper-instance-zookeeper-0.zookeeper-instance-hs.default.svc.cluster.local:2888:3888
server.2=zookeeper-instance-zookeeper-1.zookeeper-instance-hs.default.svc.cluster.local:2888:3888
server.3=zookeeper-instance-zookeeper-2.zookeeper-instance-hs.default.svc.cluster.local:2888:3888
Creating ZooKeeper log4j configuration
ZooKeeper JMX enabled by default
Using config: /conf/zoo.cfg
"2020-11-19 15:46:30,107 [myid:] - INFO [main:QuorumPeerConfig@133] - Reading configuration from: /conf/zoo.cfg
""2020-11-19 15:46:30,112 [myid:] - INFO [main:QuorumPeerConfig@385] - clientPortAddress is 0.0.0.0/0.0.0.0:2181
""2020-11-19 15:46:30,112 [myid:] - INFO [main:QuorumPeerConfig@389] - secureClientPort is not set
""2020-11-19 15:46:30,125 [myid:1] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
""2020-11-19 15:46:30,126 [myid:1] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
""2020-11-19 15:46:30,126 [myid:1] - INFO [main:DatadirCleanupManager@101] - Purge task is not scheduled.
""2020-11-19 15:46:30,127 [myid:1] - INFO [main:ManagedUtil@46] - Log4j found with jmx enabled.
""2020-11-19 15:46:30,135 [myid:1] - INFO [main:QuorumPeerMain@141] - Starting quorum peer
""2020-11-19 15:46:30,142 [myid:1] - INFO [main:ServerCnxnFactory@135] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
""2020-11-19 15:46:30,143 [myid:1] - INFO [main:NIOServerCnxnFactory@673] - Configuring NIO connection handler with 10s sessionless connection timeout, 1 selector thread(s), 2 worker threads, and 64
kB direct buffers.
""2020-11-19 15:46:30,182 [myid:1] - INFO [main:NIOServerCnxnFactory@686] - binding to port 0.0.0.0/0.0.0.0:2181
""2020-11-19 15:46:30,207 [myid:1] - INFO [main:Log@193] - Logging initialized @465ms to org.eclipse.jetty.util.log.Slf4jLog
""2020-11-19 15:46:30,387 [myid:1] - WARN [main:ContextHandler@1588] - o.e.j.s.ServletContextHandler@29ca901e{/,null,UNAVAILABLE} contextPath ends with /*
""2020-11-19 15:46:30,387 [myid:1] - WARN [main:ContextHandler@1599] - Empty contextPath
""2020-11-19 15:46:30,398 [myid:1] - INFO [main:X509Util@79] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
""2020-11-19 15:46:30,401 [myid:1] - INFO [main:FileTxnSnapLog@103] - zookeeper.snapshot.trust.empty : false
""2020-11-19 15:46:30,407 [myid:1] - INFO [main:QuorumPeer@1488] - Local sessions disabled
""2020-11-19 15:46:30,407 [myid:1] - INFO [main:QuorumPeer@1499] - Local session upgrading disabled
""2020-11-19 15:46:30,407 [myid:1] - INFO [main:QuorumPeer@1466] - tickTime set to 2000
""2020-11-19 15:46:30,407 [myid:1] - INFO [main:QuorumPeer@1510] - minSessionTimeout set to 4000
""2020-11-19 15:46:30,408 [myid:1] - INFO [main:QuorumPeer@1521] - maxSessionTimeout set to 40000
""2020-11-19 15:46:30,408 [myid:1] - INFO [main:QuorumPeer@1536] - initLimit set to 10
""2020-11-19 15:46:30,418 [myid:1] - INFO [main:ZKDatabase@117] - zookeeper.snapshotSizeFactor = 0.33
""2020-11-19 15:46:30,419 [myid:1] - INFO [main:QuorumPeer@1781] - Using insecure (non-TLS) quorum communication
""2020-11-19 15:46:30,419 [myid:1] - INFO [main:QuorumPeer@1787] - Port unification disabled
""2020-11-19 15:46:30,420 [myid:1] - INFO [main:QuorumPeer@2154] - QuorumPeer communication is not secured! (SASL auth disabled)
""2020-11-19 15:46:30,420 [myid:1] - INFO [main:QuorumPeer@2183] - quorum.cnxn.threads.size set to 20
""2020-11-19 15:46:30,421 [myid:1] - INFO [main:FileSnap@83] - Reading snapshot /var/lib/zookeeper/data/version-2/snapshot.0
""2020-11-19 15:46:30,481 [myid:1] - INFO [main:Server@370] - jetty-9.4.17.v20190418; built: 2019-04-18T19:45:35.259Z; git: aa1c656c315c011c01e7b21aabb04066635b9f67; jvm 1.8.0_242-b08
""2020-11-19 15:46:30,522 [myid:1] - INFO [main:DefaultSessionIdManager@365] - DefaultSessionIdManager workerName=node0
""2020-11-19 15:46:30,523 [myid:1] - INFO [main:DefaultSessionIdManager@370] - No SessionScavenger set, using defaults
""2020-11-19 15:46:30,525 [myid:1] - INFO [main:HouseKeeper@149] - node0 Scavenging every 660000ms
""2020-11-19 15:46:30,535 [myid:1] - INFO [main:ContextHandler@855] - Started o.e.j.s.ServletContextHandler@29ca901e{/,null,AVAILABLE}
""2020-11-19 15:46:30,582 [myid:1] - INFO [main:AbstractConnector@292] - Started ServerConnector@58651fd0{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
""2020-11-19 15:46:30,582 [myid:1] - INFO [main:Server@410] - Started @846ms
""2020-11-19 15:46:30,583 [myid:1] - INFO [main:JettyAdminServer@112] - Started AdminServer on address 0.0.0.0, port 8080 and command URL /commands
""2020-11-19 15:46:30,586 [myid:1] - INFO [main:QuorumCnxManager$Listener@861] - Election port bind maximum retries is 3
""2020-11-19 15:46:30,587 [myid:1] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@911] - My election bind port: zookeeper-instance-zookeeper-0.zookeeper-instance-hs.default.svc.cluster.local/1
0.1.190.40:3888
""2020-11-19 15:46:30,595 [myid:1] - INFO [QuorumPeermyid=1(secure=disabled):QuorumPeer@1193] - LOOKING
""2020-11-19 15:46:30,596 [myid:1] - INFO [QuorumPeermyid=1(secure=disabled):FastLeaderElection@885] - New election. My id = 1, proposed zxid=0x0
""2020-11-19 15:46:30,598 [myid:1] - WARN [WorkerSender[myid=1]:QuorumPeer$QuorumServer@196] - Failed to resolve address: zookeeper-instance-zookeeper-1.zookeeper-instance-hs.default.svc.cluster.loc
al
"java.net.UnknownHostException: zookeeper-instance-zookeeper-1.zookeeper-instance-hs.default.svc.cluster.local
at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)
at org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:764)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)
at java.lang.Thread.run(Thread.java:748)

Describe service
k describe services zookeeper-instance-hs
Name: zookeeper-instance-hs
Namespace: default
Labels: app=zookeeper
heritage=kudo
kudo.dev/instance=zookeeper-instance
kudo.dev/operator=zookeeper
zookeeper=zookeeper-instance
Annotations: kudo.dev/last-applied-configuration:
{"kind":"Service","apiVersion":"v1","metadata":{"name":"zookeeper-instance-hs","namespace":"default","creationTimestamp":null,"labels":{"a...
kudo.dev/last-plan-execution-uid: d585c0bb-cf9f-403d-91db-b08ea5f334e9
kudo.dev/phase: zookeeper
kudo.dev/plan: deploy
kudo.dev/step: deploy
Selector: app=zookeeper,instance=zookeeper-instance
Type: ClusterIP
IP: None
Port: server 2888/TCP
TargetPort: 2888/TCP
Endpoints:
Port: leader-election 3888/TCP
TargetPort: 3888/TCP
Endpoints:
Session Affinity: None
Events:

The above issue occurs with 3.5.6/3.6.1/3.6.2 and the solution is to add the following to the bootstrap.sh.yml

autopurge.snapRetainCount=$SNAP_RETAIN_COUNT
autopurge.purgeInteval=$PURGE_INTERVAL
4lw.commands.whitelist=*

EOF

With this parameter added the cluster will form. I'm still not sure why this error manifests as a DNS error but I can confirm that this fixes it on a local/AWS/OCI MicroK8s setup.