docker-library/elasticsearch

ES fails to connect to master node after IP address changed

ChristianWeissCG opened this issue · 1 comments

After i changed the IP of all nodes in my ES cluster (4 VMs, each with docker image v6.6.1), my elasticsearch cluster is still trying to access the master node on the old IP of that master node.

Host OS and within the container i can resolve the FQDN (e.g. with dig and ping) and i get the correct/new IP (new is: 10.60.7.40; old is 10.3.2.37). (in elasticsearch.yml i use FQDNs instead of IPs)

I use FQDNs in config/elasticsearch.yml:

discovery.zen.ping.unicast.hosts: ['cgbsel1.my.internal', 'cgbsel2.my.internal', 'cgbsel3.my.internal', 'cgbsel4.my.internal']

Even setting:

-Des.networkaddress.cache.ttl=1
-Des.networkaddress.cache.negative.ttl=1

in config/jvm.options did not make ES to forget the old IP.
(BTW: it is confiremed that this file is loaded by ES, verified with a "byintentioninvalid" value for both options, which results in an expected error)

Log of node 3 (hostname: cgbsel3) while trying to connect to node 1 (master, hostname: cgbsel1):

...
[2020-04-14T11:34:56,706][INFO ][o.e.n.Node               ] [cgbsel3] initialized
[2020-04-14T11:34:56,706][INFO ][o.e.n.Node               ] [cgbsel3] starting ...
[2020-04-14T11:34:56,939][INFO ][o.e.t.TransportService   ] [cgbsel3] publish_address {10.60.7.42:9300}, bound_addresses {0.0.0.0:9300}
[2020-04-14T11:34:57,097][INFO ][o.e.b.BootstrapChecks    ] [cgbsel3] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-04-14T11:35:27,155][WARN ][o.e.n.Node               ] [cgbsel3] timed out while waiting for initial discovery state - timeout: 30s
[2020-04-14T11:35:27,171][INFO ][o.e.h.n.Netty4HttpServerTransport] [cgbsel3] publish_address {10.60.7.42:9200}, bound_addresses {0.0.0.0:9200}
[2020-04-14T11:35:27,171][INFO ][o.e.n.Node               ] [cgbsel3] started
[2020-04-14T11:35:30,213][WARN ][o.e.d.z.ZenDiscovery     ] [cgbsel3] failed to connect to master [{cgbsel1}{HASH_REMOVED}{HASH_REMOVED}{10.3.2.37}{10.3.2.37:9300}{ml.machine_memory=16657203200, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retrying...
org.elasticsearch.transport.ConnectTransportException: [cgbsel1][10.3.2.37:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1576) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-04-14T11:36:03,243][WARN ][o.e.d.z.ZenDiscovery     ] [cgbsel3] failed to connect to master [{cgbsel1}{HASH_REMOVED}{HASH_REMOVED}{10.3.2.37}{10.3.2.37:9300}{ml.machine_memory=16657203200, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retrying...
org.elasticsearch.transport.ConnectTransportException: [cgbsel1][10.3.2.37:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1576) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-04-14T11:36:25,785][WARN ][r.suppressed             ] [cgbsel3] path: /graylog_*/_alias, params: {expand_wildcards=open, index=graylog_*}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-04-14T11:36:36,267][WARN ][o.e.d.z.ZenDiscovery     ] [cgbsel3] failed to connect to master [{cgbsel1}{HASH_REMOVED}{HASH_REMOVED}{10.3.2.37}{10.3.2.37:9300}{ml.machine_memory=16657203200, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retrying...
org.elasticsearch.transport.ConnectTransportException: [cgbsel1][10.3.2.37:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1576) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
...

Why is ES still trying to use the old IP? Why is the old IP reboot-save? Where is it persisted?
How to flush/replace old IP in ES?

Closing with elastic/elasticsearch#55161 (comment)

These IP addresses are not persisted in a reboot by Elasticsearch, they are re-resolved every time that we attempt to connect/re-connect to the master, so if you're still seeing these names resolve to the old IP addresses, I suspect that you have a caching layer elsewhere (the caching resolver on the host, or elsewhere in your DNS infrastructure) that is persisting these addresses.

Since this appears to be an environmental issue and not a bug in Elasticsearch, I'm going to close this issue. If you need additional assistance, please use the forums since we reserve GitHub for verified bug reports and feature requests. If it turns out there is a reproducible bug here, we can reopen this issue.