upmc-enterprises/elasticsearch-operator

Master re-election delay (10+seconds) and outage

nabadger opened this issue · 11 comments

This is actually something we can re-produce (and are struggling to understand) on a manually deployed ES cluster, but I've just tested it with the your operator as well, and it exhibits the same issue.

Generally speaking we would expect a delay of a few seconds during re-election, but when deploying ES on K8s we are seeing master re-elections take between 10 and 30s.

Config:

apiVersion: enterprises.upmc.com/v1
kind: ElasticsearchCluster
metadata:
  name: example-es-cluster
spec:
  kibana:
    image: docker.elastic.co/kibana/kibana-oss:6.1.3
  cerebro:
    image: upmcenterprises/cerebro:0.6.8
  elastic-search-image: upmcenterprises/docker-elasticsearch-kubernetes:6.1.3_0
  client-node-replicas: 1
  master-node-replicas: 3
  data-node-replicas: 1
  network-host: 0.0.0.0
  zones: []
  data-volume-size: 10Gi
  java-options: "-Xms512m -Xmx512m"
  snapshot:
    scheduler-enabled: false
    bucket-name: elasticsnapshots99
    cron-schedule: "@every 2m"
    image: upmcenterprises/elasticsearch-cron:0.0.4
  storage:
    storage-class: do-block-storage
  resources:
    requests:
      memory: 512Mi
      cpu: 500m
    limits:
      memory: 1024Mi
      cpu: '1'

To test this, I have 3 masters, and a single data and client instance.

If I watch the status of curl https://localhost:9200/_cat/nodes on the client or data node, I can monitor which is the current master.

If I kubectl exec into the elected master and kill 1 (the java process), it dies as expected, and master re-election typically happens under 3 seconds.

If I however kill the pod of the running master (kubectl delete pod <podname>), we don't see any master re-election happen for typically 10 to 30 seconds. It actually looks like it waits for the new pod to be created, and for the health probes to succeed.

During this phase, the curl command will fail (since there's no elected master).

I suspect this is related to various timeouts (both k8s and ES) but having a hard time figuring out which ones. We also think it might be DNS caching related.

Do you know if this is expected behaviour?

These logs show the point at which I kill the current master via kubectl delete pod.

The master that I killed has IP 10.244.2.6

es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:11,958][INFO ][o.e.n.Node               ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] stopping ...
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:11,975][INFO ][o.e.d.z.ZenDiscovery     ] [87fac673-05b4-41b3-9ded-6d2d474f1144] master_left [{b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300}], reason [shut_down]
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:11,979][INFO ][o.e.d.z.ZenDiscovery     ] [8c9da767-5993-4743-a852-9691301e05e2] master_left [{b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300}], reason [shut_down]
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:11,980][WARN ][o.e.d.z.ZenDiscovery     ] [8c9da767-5993-4743-a852-9691301e05e2] master left (reason = shut_down), current nodes: nodes:
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage    {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}, local
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage    {87fac673-05b4-41b3-9ded-6d2d474f1144}{wp9NynifTKuMiL4aDWpx2g}{RrA1DEOISM6fAi-EzCoa-Q}{10.244.1.7}{10.244.1.7:9300}
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage    {1c4ae2db-cd0d-4596-b727-48f70c027618}{AwhuyO5uRFSZ4lGCKap1Rg}{c05RG-6rQ36oQNkUAfuASw}{10.244.1.6}{10.244.1.6:9300}
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage    {b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300}, master
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage    {14fefd79-5f32-4492-b9b6-569f27bd9940}{bv2k3nZBTjmNsPxIIvp-hQ}{RjdWE-pnTMWFQET1H5iGoQ}{10.244.3.3}{10.244.3.3:9300}
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:11,975][WARN ][o.e.d.z.ZenDiscovery     ] [87fac673-05b4-41b3-9ded-6d2d474f1144] master left (reason = shut_down), current nodes: nodes:
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage    {14fefd79-5f32-4492-b9b6-569f27bd9940}{bv2k3nZBTjmNsPxIIvp-hQ}{RjdWE-pnTMWFQET1H5iGoQ}{10.244.3.3}{10.244.3.3:9300}
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage    {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage    {1c4ae2db-cd0d-4596-b727-48f70c027618}{AwhuyO5uRFSZ4lGCKap1Rg}{c05RG-6rQ36oQNkUAfuASw}{10.244.1.6}{10.244.1.6:9300}
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage    {87fac673-05b4-41b3-9ded-6d2d474f1144}{wp9NynifTKuMiL4aDWpx2g}{RrA1DEOISM6fAi-EzCoa-Q}{10.244.1.7}{10.244.1.7:9300}, local
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage    {b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300}, master
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:12,028][ERROR][c.a.e.s.StatsdService    ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] Exiting StatsdReporterThread
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:12,036][INFO ][c.a.e.s.StatsdService    ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] StatsD reporter stopped
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:12,037][INFO ][o.e.n.Node               ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] stopped
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:12,037][INFO ][o.e.n.Node               ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] closing ...
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:12,073][INFO ][o.e.n.Node               ] [b4c04c50-36e7-4fc3-a19a-57b70c858f1e] closed
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:15,104][INFO ][o.e.c.s.MasterService    ] [8c9da767-5993-4743-a852-9691301e05e2] zen-disco-elected-as-master ([2] nodes joined)[{1c4ae2db-cd0d-4596-b727-48f70c027618}{AwhuyO5uRFSZ4lGCKap1Rg}{c05RG-6rQ36oQNkUAfuASw}{10.244.1.6}{10.244.1.6:9300}, {87fac673-05b4-41b3-9ded-6d2d474f1144}{wp9NynifTKuMiL4aDWpx2g}{RrA1DEOISM6fAi-EzCoa-Q}{10.244.1.7}{10.244.1.7:9300}], reason: new_master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:15,180][INFO ][o.e.c.s.ClusterApplierService] [87fac673-05b4-41b3-9ded-6d2d474f1144] detected_master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}, reason: apply cluster state (from master [master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300} committed version [23]])
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:16,087][ERROR][c.f.s.s.t.SearchGuardSSLNettyTransport] [87fac673-05b4-41b3-9ded-6d2d474f1144] SSL Problem Received close_notify during handshake
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage javax.net.ssl.SSLException: Received close_notify during handshake
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.Alerts.getSSLException(Alerts.java:208) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1666) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1634) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.recvAlert(SSLEngineImpl.java:1776) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1083) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) ~[?:?]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[?:1.8.0_151]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:255) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1162) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) ~[netty-handler-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

- es-master-example-es-cluster-do-block-storage-0

...

# At this point the curl commands hang until the pod comes back

...

s-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:51,302][INFO ][o.e.c.s.ClusterApplierService] [6864cebb-f4c0-42e8-a28b-48a214a6913c] detected_master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}, added {{1c4ae2db-cd0d-4596-b727-48f70c027618}{AwhuyO5uRFSZ4lGCKap1Rg}{c05RG-6rQ36oQNkUAfuASw}{10.244.1.6}{10.244.1.6:9300},{14fefd79-5f32-4492-b9b6-569f27bd9940}{bv2k3nZBTjmNsPxIIvp-hQ}{RjdWE-pnTMWFQET1H5iGoQ}{10.244.3.3}{10.244.3.3:9300},{8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300},{87fac673-05b4-41b3-9ded-6d2d474f1144}{wp9NynifTKuMiL4aDWpx2g}{RrA1DEOISM6fAi-EzCoa-Q}{10.244.1.7}{10.244.1.7:9300},}, reason: apply cluster state (from master [master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300} committed version [25]])
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:51,264][INFO ][o.e.c.s.ClusterApplierService] [8c9da767-5993-4743-a852-9691301e05e2] removed {{b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300},}, reason: apply cluster state (from master [master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300} committed version [24] source [zen-disco-node-failed({b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300}), reason(transport disconnected)[{b4c04c50-36e7-4fc3-a19a-57b70c858f1e}{JitkNoYERu61jvEvPES3OQ}{8KB5ZO1_TeabDn_EfuDUDw}{10.244.2.6}{10.244.2.6:9300} transport disconnected]]])
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:51,293][INFO ][o.e.c.s.ClusterApplierService] [87fac673-05b4-41b3-9ded-6d2d474f1144] added {{6864cebb-f4c0-42e8-a28b-48a214a6913c}{JitkNoYERu61jvEvPES3OQ}{guDOyzEATMq7Pe32p2xxMQ}{10.244.2.7}{10.244.2.7:9300},}, reason: apply cluster state (from master [master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300} committed version [25]])
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:51,278][INFO ][o.e.c.s.MasterService    ] [8c9da767-5993-4743-a852-9691301e05e2] zen-disco-node-join[{14fefd79-5f32-4492-b9b6-569f27bd9940}{bv2k3nZBTjmNsPxIIvp-hQ}{RjdWE-pnTMWFQET1H5iGoQ}{10.244.3.3}{10.244.3.3:9300}, {6864cebb-f4c0-42e8-a28b-48a214a6913c}{JitkNoYERu61jvEvPES3OQ}{guDOyzEATMq7Pe32p2xxMQ}{10.244.2.7}{10.244.2.7:9300}], reason: added {{6864cebb-f4c0-42e8-a28b-48a214a6913c}{JitkNoYERu61jvEvPES3OQ}{guDOyzEATMq7Pe32p2xxMQ}{10.244.2.7}{10.244.2.7:9300},}
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:52,157][INFO ][c.f.s.s.h.n.SearchGuardSSLNettyHttpServerTransport] [6864cebb-f4c0-42e8-a28b-48a214a6913c] publish_address {10.244.2.7:9200}, bound_addresses {[::]:9200}
es-master-example-es-cluster-do-block-storage-0 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:52,157][INFO ][o.e.n.Node               ] [6864cebb-f4c0-42e8-a28b-48a214a6913c] started
es-master-example-es-cluster-do-block-storage-1 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:52,262][INFO ][o.e.c.s.ClusterApplierService] [8c9da767-5993-4743-a852-9691301e05e2] added {{6864cebb-f4c0-42e8-a28b-48a214a6913c}{JitkNoYERu61jvEvPES3OQ}{guDOyzEATMq7Pe32p2xxMQ}{10.244.2.7}{10.244.2.7:9300},}, reason: apply cluster state (from master [master {8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300} committed version [25] source [zen-disco-node-join[{14fefd79-5f32-4492-b9b6-569f27bd9940}{bv2k3nZBTjmNsPxIIvp-hQ}{RjdWE-pnTMWFQET1H5iGoQ}{10.244.3.3}{10.244.3.3:9300}, {6864cebb-f4c0-42e8-a28b-48a214a6913c}{JitkNoYERu61jvEvPES3OQ}{guDOyzEATMq7Pe32p2xxMQ}{10.244.2.7}{10.244.2.7:9300}]]])
es-master-example-es-cluster-do-block-storage-2 es-master-example-es-cluster-do-block-storage [2018-09-01T16:24:52,278][WARN ][o.e.t.TransportService   ] [87fac673-05b4-41b3-9ded-6d2d474f1144] Received response for a request that has timed out, sent [36098ms] ago, timed out [6098ms] ago, action [internal:discovery/zen/fd/master_ping], node [{8c9da767-5993-4743-a852-9691301e05e2}{dAsF6oonSS6_5qhqL2SCIQ}{6Ncb_DQjS_mzkFB37Ay07A}{10.244.3.6}{10.244.3.6:9300}], id [2165]

It looks like it did elect a new master 10.244.3.6 - but unsure why the curl command fails (might suggest dns/k8s routing issues?

Hm, might be the same as #228

Is it possible to confirm whether this is re-producible on your stack or not?

We believe we have found and fixed the issue - the key for us was to trap the sigterm and wait for ES to complete the failover. If this doesn't happen, kubernetes kills the pod too quickly (and the network stack), and ES can't complete the failover.

Our example: https://github.com/mintel/es-image/blob/master/run.sh

Hey @nabadger sorry for the delay in responding. I haven't looked into this just yet.

@stevesloka no problem. We've worked around it for now with our own ES deployment manifests, but I'd much rather use this operator 👍

If you're willing we can work to apply those fixes to the operator.

@stevesloka I had a look at the Dockerfile that this operator is using, I didn't realise until now, but I see you're using https://github.com/pires/docker-elasticsearch via https://github.com/upmc-enterprises/docker-elasticsearch-kubernetes

I've actually raised this same issue on the pires repo too.

The fix for this should be in the run.sh entrypoint here: https://github.com/pires/docker-elasticsearch/blob/master/run.sh

We actually based our version on this, so perhaps I can just submit a PR to pires/docker-elasticsearch, and this operator should just get the fixed :)

@nabadger actually we had some PR's from @while1eq1 who make up a new image to fix some of the SearchGuard updates. Could you look to see if this repo needs the same fix (elastic-search-image: quay.io/while1eq1/elasticsearch-kubernetes-searchguard)? It would be good to fix Pires's repo as well just to help out others.

@stevesloka thanks, it looks like it would have the same issue.

I'll test it and submit a PR to that repo if i can reproduce it.

Note, this issue has now been recognized on the main helm chart repo for elastic-search.

helm/charts#8785

We should monitor and ideally use the same solution for all elasticsearch installs on Kubernetes.

That might end up being a fix in https://github.com/elastic/elasticsearch-docker/blob/master/build/elasticsearch/bin/docker-entrypoint.sh (or something that wraps this) if the idea is to fix it in the entrypoint (like I did).