mrkamel/heartbeat

Switching don't work when ping_ip is the same as failover_ip

Closed this issue · 6 comments

Hi,

If I try to set ping_ip is the same as failover_ip, the system doesn't switch between servers.

My config:

base_url: https://robot-ws.your-server.de
basic_auth:
  username: [cutted]
  password: [cutted]
failover_ip: 1.2.3.4
ping_ip: 1.2.3.4
ips:
  - ping: 2.3.4.5
    target: 2.3.4.5
  - ping: 3.4.5.6
    target: 3.4.5.6
interval: 30
timeout: 10
tries: 3

In this case, log looks like this:

I, [2013-11-11T13:59:32.024087 #10269]  INFO -- : 1.2.3.4 is up.
I, [2013-11-11T14:00:04.242432 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:00:07.177245 #10269]  INFO -- : Not responsible for 2.3.4.5
I, [2013-11-11T14:05:09.419816 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:05:09.996530 #10269]  INFO -- : Not responsible for 2.3.4.5.
I, [2013-11-11T14:10:12.214836 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:10:12.932809 #10269]  INFO -- : Not responsible for 2.3.4.5.
I, [2013-11-11T14:15:15.176340 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:15:15.774078 #10269]  INFO -- : Not responsible for 2.3.4.5.
I, [2013-11-11T14:20:17.998591 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:20:23.707318 #10269]  INFO -- : Not responsible for 2.3.4.5.
I, [2013-11-11T14:25:25.928112 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:25:26.511541 #10269]  INFO -- : Not responsible for 2.3.4.5.
I, [2013-11-11T14:30:28.739342 #10269]  INFO -- : 1.2.3.4 is down.
I, [2013-11-11T14:30:31.446149 #10269]  INFO -- : Not responsible for 2.3.4.5.

But if I set ping_ip as the first server, switching works after the first packet lost, without tries etc:

I, [2013-11-11T15:34:36.178875 #11312]  INFO -- : 2.3.4.5 is up.
I, [2013-11-11T15:34:39.381142 #11312]  INFO -- : 2.3.4.5 is down.
I, [2013-11-11T15:34:42.955294 #11312]  INFO -- : Switching to 3.4.5.6.

System: ubuntu server 13.10

Best regards

hm, you have to add 1.2.3.4 to your ips: section.

in your scenario, 1.2.3.4 gets down, heartbeat looks up the current target for your failover ip, which currently is 1.2.3.4 (right?)

heartbeat then searches for 1.2.3.4 in the ips: section to
a) ensure heartbeat is responsible and
b) find an entry point to search for the next available ip and
c) ensure that the failover ip's target currently really points to the ip you monitor using the ping_ip option

As heartbeat does not find it within your ips: section, heartbeat reports that it is not responsible.

If heartbeat would blindly switch the failover ip when ping_ip is down, hearbeat could possible switch ips even though the server behind the failover ip is currently up.

sry, i misunderstood, looks like a bug.
will look into it.
thx for reporting.

Heartbeat now assumes to be responsible if ping_ip can be found within the ips: section OR if ping_ip is equal to the failover_ip

Hi,

Thanks for the reply.
Now it is working the same way like I mentioned before: switching to second server without 3 tries and timeouts, just after 1 packet is lost, which is not very convinient:

I, [2013-11-13T11:38:59.379911 #30802]  INFO -- : 1.2.3.4 is up.
I, [2013-11-13T11:39:31.632584 #30802]  INFO -- : 1.2.3.4 is up.
I, [2013-11-13T11:40:13.846543 #30802]  INFO -- : 1.2.3.4 is down.
I, [2013-11-13T11:40:17.666425 #30802]  INFO -- : Switching to 3.4.5.6.

Hi,

it should actually not be 1 packet loss, every ping test is run like this:

`ping -W #{timeout} -c #{tries} #{ip}`

If the log says 1.2.3.4 is down, the number of tries and the timeout should already be included.
If ping receives 1 packet, it exits with 0 (success) when used with -W and -c.
If we'd use -w (< deadline) and -c it would exit 1, but this is not the case.

Okay, thanks a lot again!