mantl/mesos-consul

mesos-consul acts up when mesos-agent is not responding

imrangit opened this issue · 26 comments

I have a big mesos cluster with 200 nodes participating.

I am seeing issues with mesos-consul that when one or more mesos-slaves are not working:

  1. It doesn't de-register those mesos-slaves.
  2. And since some of the slaves are not responding, it doesn't update the services properly in consul. For example: when I scale apps down in marathon, it doesn't update the members for that consul service.

And the only way I am able to get it to work is first to resolve the issue with the mesos-slave which it thinks is a part of the cluster. And then restart mesos-consul. Then only it updates the members properly from marathon to consul.

I think we need to make mesos-consul more robust to handle these scenarios for running in production. Because consul services are the the source of truth. And if the consul services are not reflecting the right members from marathon, it defeats the whole purpose of using mesos-consul.

-Imran

Is the host running the mesos agent still available or is the entire node unreachable? mesos-consul connects to the consul agent running on the same server as the mesos agent and if it can't reach it the service won't be deregistered.

Well the entire node was unreachable because of load issue. That brings to an important point. If any of the mesos slave(consul agent) machine in the cluster become unresponsive, then mesos-consul is going to break?
I think mesos-consul should be more resilient as this is going to be source of truth for service discovery for all the apps. And the entire system shouldn't break if one of the node goes down.

-Imran

It currently can't deregister that nodes' services if the consul agent is down. You're seeing a lot of Deregistration errors in mesos-consul's logs?

Yes, I was seeing a lot of de-register messages. Also, for some I was seeing bunch of "no route to host" as well as "i/o errors".

So what I am saying is that if there are errors like that for some hosts, why would mesos-consul not update member list correctly when I scale applications up/down in marathon.

-Imran

And it wasn't that the apps were scaled on those affected nodes. It was getting scaled up and down on other hosts, but still mesos-consul was failing to update member list in consul.

-Imran

Those errors are what I would expect since it can't connect to Consul on the down node.

mesos-consul currently can't update the nodes of the service because in order to do so it needs to deregister the service from the down node. And since it can't connect it is never deregistered.

I am going to look in to using the Consul catalog endpoint to deregister the service if the agent endpoint fails to deregister it. mesos-consul doesn't use the catalog endpoint normally because services deregistered via the catalog will be re-registered by Consul's anti-entropy process.

Hi Chris,
So what I think should happen is that mesos-consul should periodically verify if Marathon member list is different from Consul Service list. If it is, then it should remove those members from the catalog. Anyways the node is down, it is not going to register that service. But the issue comes when the down node is back up, it is going to register a service which is no longer running there. How can we clean that up?

Or the other scenario is mesos-consul doesn't use service registration through nodes. Rather it registers a service through an catalog endpoint. That way it is treated as an external service. And we wont have to have consul agents running on every mesos agents. And the only job of meos-consul would be to read marathon event bus and update member list in consul. Something like how Bamboo/HAProxy( https://github.com/QubitProducts/bamboo ) or mesos-dns (https://github.com/mesosphere/mesos-dns ) does it.

@imrangit Managing the services through the catalog endpoint doesn't work. They need to be registered via an agent otherwise they get removed via Consul's anti-entropy process.

The problem here is that your node isn't actually down. The consul cluster doesn't seem to recognize the failing node as gone. If it did, consul would have removed the services from the catalog. mesos-consul would still report that the deregistration failed but the agent's services are no longer in the catalog.

If you run consul members on a non-failed node does it still report the failed agent as alive? If so, consul force-leave <failed-node> "should" remove the node and clear up the services.

@ChrisAubuchon so lets start from the scratch.

  1. I have an app in marathon called "myapp" running 5 instances.
  2. When I first ran mesos-consul, it properly synced the consul service members.
  3. Over a period of time, i scaled "myapp" up and down and restarted it in marathon. And still I have only 5 instances of that running in marathon.
  4. However, consul service now reports 20 members as oppose to 5 members. What I have noticed is that I have duplicate entries for the same nodes in consul service. So either consul agents on the old nodes didn't remove the service (either they went down or whatever happened to them) and if they are not down, they are re-registering it again.

Now my concern is how can we fix it if we go through service registration through consul agents. Because in the end what I want is consul to reflect the right members as they are in Marathon.

Also, one thing I have noticed is when I look at STDOUT of mesos-consul, it see bunch of "Unable to register" messages for the failed nodes. When I run "consul members" for the failed node, it reported "failed". And when I did "force-leave" of that failed node, my members were still out of sync. Somehow mesos-consul is not able to pick up the changes and do the right thing.

Let me know if you need more information to troubleshoot this.

Thanks,
Imran

Out of curiosity, what version of mesos-consul are you running?

Also, can you get the output of the /v1/catalog/service/<service_with_too_many_members? The ServiceID field should look like mesos-consul:<ip_address>:<service>:<port>. The IP address field is the IP address that mesos-consul will use to deregister. Is that the address of the agent? The deregistration error No route to host is making me think that it is not...

mesos-consul: 0.3.1
consul: 0.6
marathon: 0.15.3

Here is the output of catalog for that service. It only contains 11 members as of now but my actual members in marathon are only 5.
[{"Node":"ms13.ev1","Address":"10.3.161.158","ServiceID":"mesos-consul:10.3.161.158:logstash-ypec:31972","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.158","ServicePort":31972,"ServiceEnableTagOverride":false,"CreateIndex":899264,"ModifyIndex":899264},{"Node":"ms13.ev1","Address":"10.3.161.158","ServiceID":"mesos-consul:10.3.161.158:logstash-ypec:31973","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.158","ServicePort":31973,"ServiceEnableTagOverride":false,"CreateIndex":899265,"ModifyIndex":899265},{"Node":"ms14.ev1","Address":"10.3.161.159","ServiceID":"mesos-consul:10.3.161.159:logstash-ypec:31128","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.159","ServicePort":31128,"ServiceEnableTagOverride":false,"CreateIndex":899281,"ModifyIndex":899281},{"Node":"ms14.ev1","Address":"10.3.161.159","ServiceID":"mesos-consul:10.3.161.159:logstash-ypec:31129","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.159","ServicePort":31129,"ServiceEnableTagOverride":false,"CreateIndex":899282,"ModifyIndex":899282},{"Node":"ms16.ev1","Address":"10.3.161.51","ServiceID":"mesos-consul:10.3.161.51:logstash-ypec:31240","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.51","ServicePort":31240,"ServiceEnableTagOverride":false,"CreateIndex":899262,"ModifyIndex":899262},{"Node":"ms16.ev1","Address":"10.3.161.51","ServiceID":"mesos-consul:10.3.161.51:logstash-ypec:31241","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.51","ServicePort":31241,"ServiceEnableTagOverride":false,"CreateIndex":899263,"ModifyIndex":899263},{"Node":"ms18.ev1","Address":"10.3.161.160","ServiceID":"mesos-consul:10.3.161.160:logstash-ypec:31073","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.160","ServicePort":31073,"ServiceEnableTagOverride":false,"CreateIndex":797052,"ModifyIndex":797070},{"Node":"ms8.ev1","Address":"10.3.161.156","ServiceID":"mesos-consul:10.3.161.156:logstash-ypec:31610","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.156","ServicePort":31610,"ServiceEnableTagOverride":false,"CreateIndex":891777,"ModifyIndex":891777},{"Node":"ms8.ev1","Address":"10.3.161.156","ServiceID":"mesos-consul:10.3.161.156:logstash-ypec:31611","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.156","ServicePort":31611,"ServiceEnableTagOverride":false,"CreateIndex":891778,"ModifyIndex":891778},{"Node":"ms9.ev1","Address":"10.3.161.157","ServiceID":"mesos-consul:10.3.161.157:logstash-ypec:31047","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.157","ServicePort":31047,"ServiceEnableTagOverride":false,"CreateIndex":893948,"ModifyIndex":893948},{"Node":"ms9.ev1","Address":"10.3.161.157","ServiceID":"mesos-consul:10.3.161.157:logstash-ypec:31048","ServiceName":"logstash-ypec","ServiceTags":[],"ServiceAddress":"10.3.161.157","ServicePort":31048,"ServiceEnableTagOverride":false,"CreateIndex":893949,"ModifyIndex":893949}]

Is logstash-ypec a multi-port container? That would explain the double entries. mesos-consul registers each port.

Are you seeing any deregistration errors in mesos-consul's stdout about any of these services?

The oddball one here looks to be 10.3.161.160 (it's the only one with one entry).

logstash-ypec maps only a single port. This is what docker ps said:
788bae458e1a logstash:latest "/bin/sh -c /opt/run." 4 days ago Up 4 days 0.0.0.0:31067->5044/tcp mesos-0328731c-66f0-4582-bad6-b5b44cf94eeb-S822.a2a60311-1ad8-4b96-8d44-ae6271217e0c

I am seeing deregistration messages for other services. Actually none of the services in consul are in sync with marathon. Here is the mesos-consul STDOUT:

warning msg="Unable to register mesos-consul:mesos:05da0a57-8351-4b6f-9318-9e499a4085c5-S187:ms11.ev1: Put http://10.3.161.41:8500/v1/agent/service/register: dial tcp 10.3.161.41:8500: connection refused"
time="2016-04-13T21:54:40Z" level=warning msg="Unable to register mesos-consul:10.3.161.41:local-docker-registry-ypec:31961: Put http://10.3.161.41:8500/v1/agent/service/register: dial tcp 10.3.161.41:8500: connection refused"
time="2016-04-13T21:54:40Z" level=warning msg="Unable to register mesos-consul:10.3.161.41:log-syslog-ypec:31005: Put http://10.3.161.41:8500/v1/agent/service/register: dial tcp 10.3.161.41:8500: connection refused"
time="2016-04-13T21:54:40Z" level=warning msg="Unable to register mesos-consul:10.3.161.41:logstash-forwarder-ypec:31230: Put http://10.3.161.41:8500/v1/agent/service/register: dial tcp 10.3.161.41:8500: connection refused"
time="2016-04-13T21:54:43Z" level=error msg="Put http://10.3.161.145:8500/v1/agent/service/deregister/mesos-consul:mesos:05da0a57-8351-4b6f-9318-9e499a4085c5-S309:ms100.ev1: dial tcp 10.3.161.145:8500: no route to host"

Those are Unable to register messages. Looks like there isn't a Consul agent running on 10.3.161.51 or 10.3.161.145. Any Deregistration errors for the six IPs in the logstash-ypec service?

Can you look for the logstash-ypec tasks in the mesos state? mesos_master:mesos_port/master/state.json?

Yes, you are right.
10.3.161.41 = no consul agent running (mesos agent up)
10.3.161.145 = down node
There aren't any deregistration errors for the logstash-service.

state.json properly reflect the 5 members that marathon shows it is running. But I did notice extra entries for "logstash-ypec" in state.json but they had the status of "TASK_LOST". The 5 running entries had "TASK_RUNNING" status. Does mesos-consul account for various states of a task?

mesos-consul looks at the Task.State field and only registers if that is TASK_RUNNING. Is that the field that is TASK_LOST or is TASK_LOST in the Task.Statuses.State field?

5 of them have:
{
"id": "ypec_logstash.396436a4-01c6-11e6-b82e-aeb792cf3896",
"name": "logstash.ypec",
"framework_id": "20151119-003641-4053861130-5050-18838-0000",
"executor_id": "",
"slave_id": "21acb752-84c7-493f-bc96-733a0a31d4c6-S238",
"state": "TASK_RUNNING",
"resources": {
"cpus": 0.5,
"disk": 0,
"mem": 5000,
"ports": "[31395-31396]"
},
"statuses": [
{
"state": "TASK_RUNNING"
....

2 of them have:
{
"id": "ypec_logstash.53328999-0060-11e6-a87b-0242cf564af5",
"name": "logstash.ypec",
"framework_id": "20151119-003641-4053861130-5050-18838-0000",
"executor_id": "",
"slave_id": "05da0a57-8351-4b6f-9318-9e499a4085c5-S210",
"state": "TASK_LOST",
"resources": {
"cpus": 0.5,
"disk": 0,
"mem": 5000,
"ports": "[31240-31241]"
},
"statuses": [
{
"state": "TASK_RUNNING",
"timestamp": 1460432496.24535,
"labels": [
{
"key": "Docker.NetworkSettings.IPAddress",
"value": "172.17.0.4"
}
],
"container_status": {
"network_infos": [
{
"ip_address": "172.17.0.4",
"ip_addresses": [
{
"ip_address": "172.17.0.4"
}
]
}
]
}
},
{
"state": "TASK_LOST",
"timestamp": 1460586153.8695
}
],

  1. The 2 entries with "state": "TASK_LOST", should not be registered by mesos-consul. https://github.com/CiscoCloud/mesos-consul/blob/master/mesos/mesos.go#L232
  2. This line "ports": "[31240-31241]" indicates that there are two ports so you're going to see two services in the catalog. At least up to Mesos 0.25.0. A single port container would be "ports": "[31774-31774]". I hope Mesos didn't change that. That would be annoying.
  3. Do you see "ports": lines: "[31972-31973]", "[31128-31129]", "[31240-31241]", "[31610-31611]", and "[31047-31048]"? Those are the ones that were TASK_RUNNING.
  4. Do you see anything in state.json with 31073 in the ports field? That's the oddball one from the service catalog.

@imrangit Can you try mesos-consul 0.3.2? 0.3.1 had a deregistration bug where the deregistration process would halt if it failed to deregister a service. I'll bet that's what's happening here. It's trying to deregister a service from an agent that is unreachable. It fails and stops deregistering other services.

@ChrisAubuchon: should i also upgrade to the new consul 0.6.3?

-Imran

@imrangit consul 0.6.0 is fine

@ChrisAubuchon : does the master branch have 0.3.2 changes committed? main.go still shows 0.3.1 version.

@imrangit It does. I don't know why the version is still 0.3.1. My local branch has it as 0.3.2 in main.go and says that the repo doesn't need updating.

maybe git push origin master?