mantl/mesos-consul

mesos-consul keep on connecting to terminated mesos-agents, causing high latency in updating records to consul

rahul9999 opened this issue · 0 comments

Our environment is on aws ec2 with mesos cluster size of 150 nodes. Nodes scaling is managed by aws autoscaling. We are using latest version of mesos-consul 0.4.0.

We are seeing issue with mesos-consul when a node is terminated (by ASG or by other operational activities)-

  • mesos-consul try to connect to the terminated node for de-registration until its timeout (timeout is like 30 secs) and it keeps on doing so in all upcoming runs. This begin the latency problem in updates to Consul. And at some stage when the queue of such nodes will grow huge (due to autoscaling terminations, genuine node failure or due to operational activities) then latency becomes worse due to wait for each such node connection timeout.

Workaround is re-start mesos-consul which then re-build consul data from scratch (not a desired solution).
meso-consul-deregistrationerror-timeouts.txt

Example logs (also in attachment):
time="2018-01-12T12:50:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11061: dial tcp 110.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092"
time="2018-01-12T12:50:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092: dial tcp 10.65.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973"
time="2018-01-12T12:51:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910"
time="2018-01-12T12:51:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:55Z" level=info msg="Deregistering mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17"
time="2018-01-12T12:52:25Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:52:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094"
time="2018-01-12T12:52:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:52:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911"
time="2018-01-12T12:53:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059"
time="2018-01-12T12:53:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.11:uat-dcp-api-profile-uat-dcp:172.xx.xx.13:25553"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.xx.xx.38"
time="2018-01-12T12:54:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.65.228.38: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.33:msg-ui:172.xx.xx.11:61067"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829"
time="2018-01-12T12:54:55Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:54:55Z" level=info msg="Zookeeper leader: 10.xx.xx.202:5050"
time="2018-01-12T12:54:55Z" level=info msg="reloading from master 10.xx.xx.202"
time="2018-01-12T12:54:55Z" level=info msg="Running parseState"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"

We are requesting a configuration parameter like 'max-attempts for failed nodes' which can be used to control the connection attempt count and when that count is reached mesos-consul should remove the failed node from its memory i.e should never ever connect again.

-Rahul