Re-election of master causes ~40s cluster outage

Question

Re-election of master causes ~40s cluster outage

erkolson opened this issue 6 years ago · 6 comments

The Elasticsearch team indicate (in issues/discussions) that a master re-election should only cause a ~5 second outage of the cluster. In practice, I am consistently seeing 40-55 second outages. It could possibly be related to #105 ?

By outage, I mean the clients timeout on any call, whether it be _cluster/health or even _cat/indices.

From looking at the logs, it appears that the remaining master eligible nodes notice that the actual master has left the cluster, they quickly re-elect a new master, then nothing happens for >30s. It appears the cluster is waiting for a response from the now dead master. When the dead master is finally removed (zen-disco-node-failed log messsage) everything works again.

This feels like a misconfiguration of the masters.

I've replicated the issue in both 5.x and 6.x.

Answer 1 · 2018-07-12T06:18:38.000Z

Did you try disabling IPv6 as per comment in #105?

Answer 2 · 2018-07-12T13:59:08.000Z

I just gave it a try and with NETWORK_HOST set to _eth0:ipv4_ on the masters and still see the same behavior.

Answer 3 · 2018-07-12T14:08:37.000Z

Can you try the latest 6.3.1 images (this repo has yet to be updated) and report back? Want to make sure this isn't a version thing.

Also, can you reproduce this all the time? Have you tried different Kubernetes clusters?

Answer 4 · 2018-07-13T01:10:29.000Z

I'm seeing the same behavior with 6.3.1.

So far, all of my testing has been in GKE, but I've seen this behavior across several versions. I have docker edge so can try with the kubernetes that provides if I have enough resources on my workstation.

After a lot of troubleshooting today I got the outage down to 14s. With help from a colleague who knows ES better than me today, we did lots of testing and log reading and adjusted the various timeouts available. From the logs, it looks like the cluster is waiting for an acknowledgement from the dead master about the new cluster state. Changing the timeouts in the masters wasn't enough, I eventually changed them in the clients and data nodes as well.

The image I'm using is based on yours, I've added some more configuration to elasticsearch.yml

discovery:
  zen:
    ping.unicast.hosts: ${DISCOVERY_SERVICE}
    minimum_master_nodes: ${NUMBER_OF_MASTERS}
    commit_timeout: ${ZEN_COMMIT_TIMEOUT}
    publish_timeout: ${ZEN_PUBLISH_TIMEOUT}
    fd:
      ping_interval: ${FD_PING_INTERVAL}
      ping_timeout: ${FD_PING_TIMEOUT}
      ping_retries: ${FD_PING_RETRIES}

And these are the values from my the test run where the outage was down to 14s:

        - name: ZEN_COMMIT_TIMEOUT
          value: "5s"
        - name: ZEN_PUBLISH_TIMEOUT
          value: "5s"
        - name: FD_PING_TIMEOUT
          value: "1s"
        - name: FD_PING_INTERVAL
          value: "1s"
        - name: FD_PING_RETRIES
          value: "3"

These timeout values are all defaulted at 30s I believe.

In all my testing, this is the log message that coincides with the cluster becoming available again:

[2018-07-13T01:06:03,056][INFO ][o.e.c.s.MasterService    ] [es-master-85c765b64b-zjxzv] zen-disco-node-failed({es-master-85c765b64b-w84n9}{EdRMiBJ1S3Wbc8YZRRQ7jg}{BN1tySSWQeyjqoSDqZlj7Q}{10.12.4.30}{10.12.4.30:9300}{xpack.installed=true}), reason(transport disconnected), reason: removed {{es-master-85c765b64b-w84n9}{EdRMiBJ1S3Wbc8YZRRQ7jg}{BN1tySSWQeyjqoSDqZlj7Q}{10.12.4.30}{10.12.4.30:9300}{xpack.installed=true},}

As I mentioned earlier, the only sense I can make of it is that the cluster nodes are waiting for the dead master to confirm the new cluster state. Once it is removed, everything works again as all members have committed the new state.

@chrislovecnm asked me to add him here.

Answer 5 · 2018-07-16T17:49:53.000Z

After much troubleshooting, I realized the actual problem is the query I am running. The documentation says "reads" should succeed while the master is being elected and I assumed that a GET on /_cat/indices would be a "read". This call will fail for the duration that I have described in this issue but a search query like /<index-name>/_search?q=some%20text will work throughout the entire master election & cluster state update/publish process.

Answer 6 · 2018-08-02T12:28:26.000Z

Thanks for reporting back @erkolson. I'm going to close this, but let me know if you want to re-open and define any other course of action.