Re-election of master causes ~40s cluster outage
erkolson opened this issue · 6 comments
The Elasticsearch team indicate (in issues/discussions) that a master re-election should only cause a ~5 second outage of the cluster. In practice, I am consistently seeing 40-55 second outages. It could possibly be related to #105 ?
By outage, I mean the clients timeout on any call, whether it be _cluster/health
or even _cat/indices
.
From looking at the logs, it appears that the remaining master eligible nodes notice that the actual master has left the cluster, they quickly re-elect a new master, then nothing happens for >30s. It appears the cluster is waiting for a response from the now dead master. When the dead master is finally removed (zen-disco-node-failed log messsage) everything works again.
This feels like a misconfiguration of the masters.
I've replicated the issue in both 5.x and 6.x.
I just gave it a try and with NETWORK_HOST
set to _eth0:ipv4_
on the masters and still see the same behavior.
Can you try the latest 6.3.1 images (this repo has yet to be updated) and report back? Want to make sure this isn't a version thing.
Also, can you reproduce this all the time? Have you tried different Kubernetes clusters?
I'm seeing the same behavior with 6.3.1.
So far, all of my testing has been in GKE, but I've seen this behavior across several versions. I have docker edge so can try with the kubernetes that provides if I have enough resources on my workstation.
After a lot of troubleshooting today I got the outage down to 14s. With help from a colleague who knows ES better than me today, we did lots of testing and log reading and adjusted the various timeouts available. From the logs, it looks like the cluster is waiting for an acknowledgement from the dead master about the new cluster state. Changing the timeouts in the masters wasn't enough, I eventually changed them in the clients and data nodes as well.
The image I'm using is based on yours, I've added some more configuration to elasticsearch.yml
discovery:
zen:
ping.unicast.hosts: ${DISCOVERY_SERVICE}
minimum_master_nodes: ${NUMBER_OF_MASTERS}
commit_timeout: ${ZEN_COMMIT_TIMEOUT}
publish_timeout: ${ZEN_PUBLISH_TIMEOUT}
fd:
ping_interval: ${FD_PING_INTERVAL}
ping_timeout: ${FD_PING_TIMEOUT}
ping_retries: ${FD_PING_RETRIES}
And these are the values from my the test run where the outage was down to 14s:
- name: ZEN_COMMIT_TIMEOUT
value: "5s"
- name: ZEN_PUBLISH_TIMEOUT
value: "5s"
- name: FD_PING_TIMEOUT
value: "1s"
- name: FD_PING_INTERVAL
value: "1s"
- name: FD_PING_RETRIES
value: "3"
These timeout values are all defaulted at 30s I believe.
In all my testing, this is the log message that coincides with the cluster becoming available again:
[2018-07-13T01:06:03,056][INFO ][o.e.c.s.MasterService ] [es-master-85c765b64b-zjxzv] zen-disco-node-failed({es-master-85c765b64b-w84n9}{EdRMiBJ1S3Wbc8YZRRQ7jg}{BN1tySSWQeyjqoSDqZlj7Q}{10.12.4.30}{10.12.4.30:9300}{xpack.installed=true}), reason(transport disconnected), reason: removed {{es-master-85c765b64b-w84n9}{EdRMiBJ1S3Wbc8YZRRQ7jg}{BN1tySSWQeyjqoSDqZlj7Q}{10.12.4.30}{10.12.4.30:9300}{xpack.installed=true},}
As I mentioned earlier, the only sense I can make of it is that the cluster nodes are waiting for the dead master to confirm the new cluster state. Once it is removed, everything works again as all members have committed the new state.
@chrislovecnm asked me to add him here.
After much troubleshooting, I realized the actual problem is the query I am running. The documentation says "reads" should succeed while the master is being elected and I assumed that a GET
on /_cat/indices
would be a "read". This call will fail for the duration that I have described in this issue but a search query like /<index-name>/_search?q=some%20text
will work throughout the entire master election & cluster state update/publish process.