elasticsearch-discovery hostname not found (related to closed issue #35)
mtaylor769 opened this issue · 24 comments
Hi. Working on 6.0.0 with @morphers82. I've been attempting many fixes (read #35 a dozen times and focused on kube-dns) and getting the following error only on es-client nodes:
[2017-11-26T21:47:04,470][WARN ][o.e.d.z.ZenDiscovery ] [es-client-79f5fc4588-6p7v2] not enough master nodes discovered during pinging (found [[]], but needed [1]), pinging again [2017-11-26T21:47:04,471][WARN ][o.e.d.z.UnicastZenPing ] [es-client-79f5fc4588-6p7v2] failed to resolve host [elasticsearch-discovery] java.net.UnknownHostException: elasticsearch-discovery at java.net.InetAddress.getAllByName0(InetAddress.java:1280) ~[?:1.8.0_131] at java.net.InetAddress.getAllByName(InetAddress.java:1192) ~[?:1.8.0_131] at java.net.InetAddress.getAllByName(InetAddress.java:1126) ~[?:1.8.0_131] at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:882) ~[elasticsearch-6.0.0.jar:6.0.0] at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:837) ~[elasticsearch-6.0.0.jar:6.0.0] at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:706) ~[elasticsearch-6.0.0.jar:6.0.0] at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$null$0(UnicastZenPing.java:213) ~[elasticsearch-6.0.0.jar:6.0.0] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_131] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
I understand "this is not elasticsearch or kubernetes" as mentioned in #35 , but it was working before upgrading to 6.0.0. Is there an alternative to kube-dns that may work?
es-client pods (sorry)
Please, share the results of the following:
kubectl describe service elasticsearch-discovery
And:
kubectl -n kube-system describe deployment kube-dns
Hi!, thanks for responding so quickly. Here's the output from the elasticsearch-discovery service (note: I edited the live config to use type: NodePort and set the externalIPs: to the subnet that connects to the router. I can set that back to 'ClusterIP' if needed):
keyinsp@ubuntu:~$ kubectl describe --namespace=keyinsp-es-cluster svc/elasticsearch-discovery
Name: elasticsearch-discovery
Namespace: keyinsp-es-cluster
Labels: component=elasticsearch
role=master
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"component":"elasticsearch","role":"master"},"name":"elasticsearch-discovery...
Selector: component=elasticsearch,role=master
Type: NodePort
IP: 10.106.67.252
External IPs: 192.168.1.17
Port: transport 9300/TCP
TargetPort: 9300/TCP
NodePort: transport 32715/TCP
Endpoints: <none>
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Here is the kube-dns 'describe' output:
keyinsp@ubuntu:~$ kubectl -n kube-system describe deployment kube-dns
Name: kube-dns
Namespace: kube-system
CreationTimestamp: Fri, 24 Nov 2017 04:49:05 -0600
Labels: k8s-app=kube-dns
Annotations: deployment.kubernetes.io/revision=1
Selector: k8s-app=kube-dns
Replicas: 2 desired | 2 updated | 2 total | 0 available | 2 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 0 max unavailable, 10% max surge
Pod Template:
Labels: k8s-app=kube-dns
Service Account: kube-dns
Containers:
kubedns:
Image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.5
Ports: 10053/UDP, 10053/TCP, 10055/TCP
Args:
--domain=cluster.local.
--dns-port=10053
--config-dir=/kube-dns-config
--v=2
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:10054/healthcheck/kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
Environment:
PROMETHEUS_PORT: 10055
Mounts:
/kube-dns-config from kube-dns-config (rw)
dnsmasq:
Image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.5
Ports: 53/UDP, 53/TCP
Args:
-v=2
-logtostderr
-configDir=/etc/k8s/dns/dnsmasq-nanny
-restartDnsmasq=true
--
-k
--cache-size=1000
--log-facility=-
--server=/cluster.local/127.0.0.1#10053
--server=/in-addr.arpa/127.0.0.1#10053
--server=/ip6.arpa/127.0.0.1#10053
Requests:
cpu: 150m
memory: 20Mi
Liveness: http-get http://:10054/healthcheck/dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/etc/k8s/dns/dnsmasq-nanny from kube-dns-config (rw)
sidecar:
Image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.5
Port: 10054/TCP
Args:
--v=2
--logtostderr
--probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
--probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
Requests:
cpu: 10m
memory: 20Mi
Liveness: http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts: <none>
Volumes:
kube-dns-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kube-dns
Optional: true
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: kube-dns-545bc4bfd4 (2/2 replicas created)
Events: <none>
Endpoints: <none>
So no backends running.
On which? I need a more verbose answer, I have been troubleshooting this for four days now.
$ kubectl describe --namespace=keyinsp-es-cluster svc/elasticsearch-discovery
This returns no backends so it means no pods are being picked by the service, hence your issue.
Okay, so why aren't the es-client pods being picked? This is using your script that I had working fine a week ago.
elasticsearch-discovery
is used only to expose Elasticsearch masters!
Selector: component=elasticsearch,role=master
The logs above seem to show one instance of es-client
trying to find es-master
but none is found:
found [[]], but needed [1])
So, there are no masters running & ready.
No, that is not the case, there are three es-masters up and running:
kubectl get pods -n keyinsp-es-cluster
NAME READY STATUS RESTARTS AGE
es-client-674b4c8596-vhpb9 0/1 Running 0 1m
es-client-674b4c8596-zbrw8 0/1 Running 0 1m
es-data-588b66c986-74xc9 0/1 Pending 0 1m
es-data-588b66c986-hz8tr 0/1 Pending 0 1m
es-master-86bb4545f5-dcpcs 1/1 Running 2 8h
es-master-86bb4545f5-pt88f 1/1 Running 0 8h
es-master-86bb4545f5-tgwq9 1/1 Running 2 8h
kubectl -n keyinsp-es-cluster describe pod es-master-86bb4545f5-tgwq9
kubectl -n keyinsp-es-cluster logs pod es-master-86bb4545f5-dcpcs
Hi, Paulo. Sorry for the delay, and thanks again for helping me through this - I know it's probably something trivial. Here is the output, I re-deployed the containers so the hashes don't match, but these are the current pods:
keyinsp@ubuntu:~/kes$ kubectl -n keyinsp-es-cluster describe po/es-master-86bb4545f5-dqfq7
Name: es-master-86bb4545f5-dqfq7
Namespace: keyinsp-es-cluster
Node: obsidian/192.168.1.194
Start Time: Tue, 28 Nov 2017 10:22:15 -0600
Labels: component=elasticsearch
pod-template-hash=4266010191
role=master
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"keyinsp-es-cluster","name":"es-master-86bb4545f5","uid":"4bcab8dc-d458-11e7-bfa9-...
Status: Running
IP: 10.244.4.30
Created By: ReplicaSet/es-master-86bb4545f5
Controlled By: ReplicaSet/es-master-86bb4545f5
Init Containers:
init-sysctl:
Container ID: docker://d7af49e67e16d52edf50e3db589a8c23bf2f30331ce03c5b26e4dd2eaa5387df
Image: busybox
Image ID: docker-pullable://busybox@sha256:bbc3a03235220b170ba48a157dd097dd1379299370e1ed99ce976df0355d24f0
Port: <none>
Command:
sysctl
-w
vm.max_map_count=262144
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 28 Nov 2017 10:22:17 -0600
Finished: Tue, 28 Nov 2017 10:22:17 -0600
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xkfqf (ro)
Containers:
es-master:
Container ID: docker://77360256f337fa6b8def61e33fe60567cf1a15413531e90edc76d12c26171a6f
Image: quay.io/pires/docker-elasticsearch-kubernetes:6.0.0
Image ID: docker-pullable://quay.io/pires/docker-elasticsearch-kubernetes@sha256:62d1dbf7b7c0a47a560b97a53753c980be62c719db55c3fc9128d7a9315daa1e
Port: 9300/TCP
State: Running
Started: Tue, 28 Nov 2017 10:23:51 -0600
Last State: Terminated
Reason: Error
Exit Code: 143
Started: Tue, 28 Nov 2017 10:23:22 -0600
Finished: Tue, 28 Nov 2017 10:23:39 -0600
Ready: True
Restart Count: 3
Limits:
memory: 8Gi
Requests:
memory: 8Gi
Liveness: tcp-socket :9300 delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
NAMESPACE: keyinsp-es-cluster (v1:metadata.namespace)
NODE_NAME: es-master-86bb4545f5-dqfq7 (v1:metadata.name)
CLUSTER_NAME: keyinsp-es-cluster
NODE_MASTER: false
NODE_DATA: false
HTTP_ENABLE: true
ES_JAVA_OPTS: -Xms8g -Xmx8g
Mounts:
/data/elasticsearch/hot/1 from nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xkfqf (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
nfs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
default-token-xkfqf:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xkfqf
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/hostname=obsidian
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47m default-scheduler Successfully assigned es-master-86bb4545f5-dqfq7 to obsidian
Normal SuccessfulMountVolume 47m kubelet, obsidian MountVolume.SetUp succeeded for volume "nfs"
Normal SuccessfulMountVolume 47m kubelet, obsidian MountVolume.SetUp succeeded for volume "default-token-xkfqf"
Normal Pulled 47m kubelet, obsidian Container image "busybox" already present on machine
Normal Created 47m kubelet, obsidian Created container
Normal Started 47m kubelet, obsidian Started container
Normal Pulling 46m (x3 over 47m) kubelet, obsidian pulling image "quay.io/pires/docker-elasticsearch-kubernetes:6.0.0"
Normal Killing 46m (x2 over 47m) kubelet, obsidian Killing container with id docker://es-master:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 46m (x3 over 47m) kubelet, obsidian Successfully pulled image "quay.io/pires/docker-elasticsearch-kubernetes:6.0.0"
Normal Created 46m (x3 over 47m) kubelet, obsidian Created container
Normal Started 46m (x3 over 47m) kubelet, obsidian Started container
Warning Unhealthy 46m (x6 over 47m) kubelet, obsidian Liveness probe failed: dial tcp 10.244.4.30:9300: getsockopt: connection refused
and the logs (up to the first error) :
[2017-11-28T16:23:55,147][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] initializing ...
[2017-11-28T16:23:55,291][INFO ][o.e.e.NodeEnvironment ] [es-master-86bb4545f5-dqfq7] using [1] data paths, mounts [[/data (/dev/mapper/obsidian--vg-root)]], net usable_space [398.1gb], net total_space [426.1gb], types [ext4]
[2017-11-28T16:23:55,292][INFO ][o.e.e.NodeEnvironment ] [es-master-86bb4545f5-dqfq7] heap size [7.9gb], compressed ordinary object pointers [true]
[2017-11-28T16:23:55,293][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] node name [es-master-86bb4545f5-dqfq7], node ID [O0-ah9G5QruoR03wcJGtQA]
[2017-11-28T16:23:55,293][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] version[6.0.0], pid[1], build[8f0685b/2017-11-10T18:41:22.859Z], OS[Linux/4.4.0-101-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_131/25.131-b11]
[2017-11-28T16:23:55,293][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xms8g, -Xmx8g, -Des.path.home=/elasticsearch, -Des.path.conf=/elasticsearch/config]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [aggs-matrix-stats]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [analysis-common]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [ingest-common]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [lang-expression]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [lang-mustache]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [lang-painless]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [parent-join]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [percolator]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [reindex]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [repository-url]
[2017-11-28T16:23:58,590][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [transport-netty4]
[2017-11-28T16:23:58,591][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] loaded module [tribe]
[2017-11-28T16:23:58,591][INFO ][o.e.p.PluginsService ] [es-master-86bb4545f5-dqfq7] no plugins loaded
[2017-11-28T16:24:01,503][INFO ][o.e.d.DiscoveryModule ] [es-master-86bb4545f5-dqfq7] using discovery type [zen]
[2017-11-28T16:24:02,213][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] initialized
[2017-11-28T16:24:02,213][INFO ][o.e.n.Node ] [es-master-86bb4545f5-dqfq7] starting ...
[2017-11-28T16:24:02,872][INFO ][o.e.t.TransportService ] [es-master-86bb4545f5-dqfq7] publish_address {10.244.4.30:9300}, bound_addresses {10.244.4.30:9300}
[2017-11-28T16:24:02,886][INFO ][o.e.b.BootstrapChecks ] [es-master-86bb4545f5-dqfq7] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-11-28T16:24:03,235][INFO ][o.e.m.j.JvmGcMonitorService] [es-master-86bb4545f5-dqfq7] [gc][1] overhead, spent [420ms] collecting in the last [1s]
[2017-11-28T16:24:07,904][WARN ][o.e.d.z.UnicastZenPing ] [es-master-86bb4545f5-dqfq7] timed out after [5s] resolving host [elasticsearch-discovery]
[2017-11-28T16:24:10,908][WARN ][o.e.d.z.ZenDiscovery ] [es-master-86bb4545f5-dqfq7] not enough master nodes discovered during pinging (found [[]], but needed [1]), pinging again
[2017-11-28T16:24:10,908][WARN ][o.e.d.z.UnicastZenPing ] [es-master-86bb4545f5-dqfq7] failed to resolve host [elasticsearch-discovery]
java.net.UnknownHostException: elasticsearch-discovery
at java.net.InetAddress.getAllByName0(InetAddress.java:1280) ~[?:1.8.0_131]
at java.net.InetAddress.getAllByName(InetAddress.java:1192) ~[?:1.8.0_131]
at java.net.InetAddress.getAllByName(InetAddress.java:1126) ~[?:1.8.0_131]
One thing that was suggested in my research is to set the loopback 'server.hostname=0.0.0.0' in the elasticsearch.yml and kibana.yml (which I was also having trouble with connecting to when this was running previously).
Interesting. This may be something with Elasticsearch 6.0.0. I will investigate while I update the docs in github.com/pires/kubernetes-elasticsearch-cluster.
Great. Just as an extra piece of info: I tried rolling back to 5.6.3 and I'm seeing the same errors from the es-master pods, so I don't think it's specific to 6.0.0, something else is missing in my setup... I've been looking at kube-dns component to see if I missed some serviceaccount or something, but again, I've been stabbing at this for several days before reaching out to you. I'll look for your updates.
Yes, it could be an issue with kube-dns, indeed. Try the following:
kubectl run -i --tty dns-debug --image=busybox --restart=Never -- sh
Then, inside the pod container:
nslookup kubernetes.default
nslookup google.com
Paste the results here.
here we go:
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
and
/ # nslookup google.com
Server: 10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'google.com'
is an nginix-http server required?
from the node the kube cluster is running on:
keyinsp@ubuntu:~/kes$ ping google.com
PING google.com (172.217.4.110) 56(84) bytes of data.
64 bytes from ord36s04-in-f14.1e100.net (172.217.4.110): icmp_seq=1 ttl=52 time=14.8 ms
64 bytes from ord36s04-in-f14.1e100.net (172.217.4.110): icmp_seq=2 ttl=52 time=14.9 ms
64 bytes from ord36s04-in-f14.1e100.net (172.217.4.110): icmp_seq=3 ttl=52 time=14.8 ms
64 bytes from ord36s04-in-f14.1e100.net (172.217.4.110): icmp_seq=4 ttl=52 time=14.9 ms
^C
--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 14.804/14.896/14.978/0.108 ms
keyinsp@ubuntu:~/kes$ nslookup google.com
Server: 192.168.1.1
Address: 192.168.1.1#53
Non-authoritative answer:
Name: google.com
Address: 172.217.9.78
kube-dns is not working for you.
you can try something like ```
kubectl -n kube-system get deployment,pods
and ```
kubectl -n kube-system describe service kube-dns
Ok. I see the problem. It's stuck in a CrashLoopBackoff. I may have to kubeadm reset
(again... this gave me great headaches last week). Here is the kubadm init command I'm using (perhaps I need 1.8.4?)
kubeadm init --kubernetes-version v1.8.1 --token-ttl=0 --pod-network-cidr=10.244.0.0/16
--token-ttl=-0
is for expiring the tokens always, and of course the --pod-netword-cidr
is for Flannel. I also tried adding --apiserver-advertise-address=0.0.0.0
but that never worked.
also, I have the following in /etc/cni/net.d/
directory which may be conflicting because 10-flannel.conf
is generated when applying the -cidr (from what I understand), and the 10-kube.conf
was a suggested production server conf on the Kubernetes github:
keyinsp@ubuntu:~/kes$ cat /etc/cni/net.d/10-flannel.conf
{
"name": "cbr0",
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
}
keyinsp@ubuntu:~/kes$ cat /etc/cni/net.d/10-kube.conf
{
"cniVersion": "0.3.1",
"name": "bridge",
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"subnet": "10.152.183.0/16"
},
"plugins": [
{
"name": "weave",
"type": "weave-net",
"hairpinMode": true
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
}
]
}
Glad you figured it out! I can't help with kubeadm, though. Good luck.
Thanks again for your help, Paulo. I will let you know if re-initializing fixes the issue and what the underlying solution is in case this comes up for someone else in the future.
Call me Pires. No problem. I will update the kubernetes-elasticsearch-cluster repo and if I find similar issues I'll re-open this.