caio/foca

Some inconsistent behavior in the new version v0.17.0, compared to v0.16.0

Closed this issue · 4 comments

The cluster has two nodes, called node1 and node2.
when restarting node2 very quickly:

  1. node1 has only receives Notification::Rename. However, there should be Notification::Rename and Notification::Up, as described in comments.
  2. node2 sometimes got DataFromOurselves error, which is probabilistic. Then I looks into the Input::Data that causes this error, and decode it into payload::Header, and the approximate structure is Header { src: node2, dest: node2, message: IndirentPing{ origin: node1}}

Thank you for the report!

  1. node1 has only receives Notification::Rename. However, there should be Notification::Rename and Notification::Up, as described in comments

This depends on wether node1 learns about node2 going down before it learns about the (new) node2 restarting.

I suspect that you're simply interrupting the process on node2 and starting it up again, right? That means that it will take a little while for the cluster (just node1 this case) to find out that the member is gone (that's what Config::probe_period governs: how frequently one member pings a random one to figure out if it's still up)

If instead of just killing the process you teach it to do a clean exit via Foca::leave_cluster, then the cluster will learn that the member left very quickly and you should get the behaviour you're expecting.

Reasoning about it can be a bit confusing because we're talking about shared state: we know we just plugged off a member and it's down, but the rest of the cluster has no idea of what has happened yet. Any suggestions of how to improve the docs here?

2. node2 sometimes got DataFromOurselves error, which is probabilistic. Then I looks into the Input::Data that causes this error, and decode it into payload::Header, and the approximate structure is Header { src: node2, dest: node2, message: IndirentPing{ origin: node1}}

Aha, that's a bug 😁

From your description I suspect the bug is that Foca is not verifying that the probed member is still active halfway through the probe so:

  1. node1 chose to probe node2(old)
  2. node2(old) became node2(new)
  3. node1 learned about node2(new) while still probing node2(old)
  4. node1 asked node2(new) to ping node2(old) on its behalf since it didn't respond to its ping

I'll get to fixing properly later today and will update the issue accordingly- it's a simple fix. Good news is that it's a harmless bug (the probe would have failed anyway since node2(old) is not active anymore)

This bug, however, should be present on every foca version- the probe logic hasn't been touched since then.

If you think there's something more complex going on, it'd be helpful to know precisely what code you're running and how so please share it if you can. At the very least, I'd need to see the Identity implementation

Thanks for your reply!

I suspect that you're simply interrupting the process on node2 and starting it up again, right?

yes, i terminate the process via ctrl-c. In older foca version, there will be a Member::Up notification, and followed by Member::Down when restart quickly. In my view, restart node can be consided as a Member::Up behaviour? so a Member::Rename and followed by a Member::Up is better? But that's okay, Member::Rename is enough to me to handle my own code logic.

If instead of just killing the process you teach it to do a clean exit via Foca::leave_cluster, then the cluster will learn that the member left very quickly and you should get the behaviour you're expecting.

Unfortunately, unexpected machine crash or power off are similar situation, which can't call Foca::leave_cluster. I need to handle such situation in my code, so i am not do a clean exit via Foca::leave_cluster even if the process exit manually.

This bug, however, should be present on every foca version- the probe logic hasn't been touched since then.

I suspects this new logic at header.src.addr() == self.identity.addr() lead to DataFromOurselves. In older foca version, the if condition will not pass.

In my view, restart node can be consided as a Member::Up behaviour? so a Member::Rename and followed by a Member::Up is better? But that's okay, Member::Rename is enough to me to handle my own code logic.

Notice that notifications are about how the current foca instance is interpreting the steam of events/updates in the cluster; There's no guarantee that the events arrive in order so one node can experience just a Rename notification while other gets Rename and MemberUp (or MemberDown).

The only information the Rename notification gives is that it found two identities with the same address and it's now preferring the later. In your case you get the Rename notification because when you start up you create a new identity - if instead you saved the id to disk and reused it you would get no notification.

I suspects this new logic at header.src.addr() == self.identity.addr() lead to DataFromOurselves. In older foca version, the if condition will not pass.

Thanks! That's the sanity check. What we want is to prevent what this is protecting against: a node sending a message to itself. What caused this is what I described in the previous message.

I've implemented a fix for it on bb9ccbb - will release it later today or early tomorrow as v0.17.1

I've released v0.17.1 with a fix for this bug (and another one I caught as I stressed the codepath a bit while testing this one)

Thank you very much for the report! Feel free to reopen and/or ask questions if the problem persists on your side.