tinode/chat

[Ask About Cluster Design]

riandyrn opened this issue · 17 comments

Hello, Gene

I have question regarding current cluster design. I found that user’s topic always sticked to a certain cluster node. Why?

In the event of node failure this will lead to users (which their topics sticked to that node) won’t be able to access their data nor contacted by others, right?

Hi Riandy,

Yes, that's indeed the case. I have not implemented failover yet, just clustering. Failover is straightforward with one caveat: when the server fails it's usually restarted. So the sessions would be migrated twice.

Is failover important for your use case? If so I can look into it.

Failover is straightforward with one caveat: when the server fails it's usually restarted. So the sessions would be migrated twice.

Hmm…, I don’t particularly understand this line, Gene. Would you please elaborate the idea of failover on Tinode?

Is failover important for your use case?

Yes, we’re trying to use Tinode for handling high number of connections. There will be times when we most likely need to turnoff our server & it could take several minutes to complete. Since Tinode uses consistent hashing, while this event occurs, a lot of users which bound to this server won’t be able to access their topics. So failover mechanism is important in our case.

Suppose there are three servers A, B, C, three user sessions U1, U2, U3, six topics T1, T2.. T6:

U1 is connected to A, U2 to B, U3 to C. A handles topics T1 and T2, B - T3 and T4, C - T5, T6.
Session U1 is attached to topics T1 and T3, U2 to T2 and T4, U3 to T3 and T6. Like this:

U1(T1, T3) -> A(T1, T2) ... T3 -> B
U2(T2, T4) -> B(T3, T4) ... T2 -> A
U3(T3, T6) -> C(T5, T6) ... T3 -> B

The dotted line ... means in-cluster topic forwarding.

Suppose server B goes down.

  1. Without failover (how it works now)
U1(T1, T3) -> A(T1, T2) ... T3 -X-> B
U2(T2, T4) -X-> B(T3, T4)
U3(T3, T6) -> C(T5, T6) ... T3 -X-> B

(a) U2 connection is lost. U2 must reconnect to another server. Tinode does not provide facility for that. You need something infront of the cluster, like HAProxy or nginex.
(b) Topics T3 and T4 become unavailable for the duration of the outage.
(c) When the server B comes online, U1 and U2 must reattach to topic T3 - they have to send a new {sub} request.

  1. With failover
U1(T1, T3) -> A(T1, T2, T4) ... T3 -> C
U2(T2, T4) -X-> B()
U3(T3, T6) -> C(T5, T6, T3)

(a) U2 connection is lost, the same as 1(a).
(b) T3 and T4 migrate to other servers. This causes a disruption, U1 and U3 must re-attach to T3 by sending {sub}.
(c) When the server B comes online, topics T3 and T4 migrate back to B causing the second disruption, U1 and U3 must re-attach to T3 by sending {sub} again.

Technically it's possible to avoid the disruption in 2(b) and 2(c) but it's a bit tricky.

I understand that you want to use it in production. My question was more around how important is to avoid any disruption. Suppose you have 5 servers and restart one of them. 20% of your users will have to be reconnected to other servers anyway, with or without the failover. Without failover 20% of topics will become unavailable for the duration of the outage. If the server is down for a few minutes it may be an acceptable disruption.

Please take a look: #30

Hello, Gene

Sorry for the delay, I need to take on another tasks first. I’ll take a look at #30 today.

Hello, Gene

I encountered two (which likely to be) bugs when exploring failover feature. The first one is endless rehashing bug & the second one is leadership competition between each node bug. All of them resulted in freezed topic interaction in all nodes.

Here are the steps which I was using to produce the endless rehashing bug:

  1. Start node 1, node 2, node 3
  2. Terminate node 1 & node 2 quickly one after another
  3. Node 3 will print: “initiating election after failed pings: 8”
  4. Wait for a moment (I wait until node 3 terminal full with above messages)
  5. Start node 1 & node 2
  6. Node 3 will print endlessly: “cluster: rehashing at request of a leader one []”
  7. All topic interactions would be locked in all nodes

You could see how I produce the bug here:

Endless Rehashing Bug

To produce the leadership competition bug, we just simply start node 1, 2, 3 quickly after another. You could see how I produce the bug here:

Endless Rehashing Bug

Let me know your opinion.

Thanks

Looking into it. Thanks.

It should work now: 12017b2

Hello, Gene

I found another 2 bugs.

The first one is super tricky. Sometimes it happens, sometimes it doesn’t. But a lot of time it’s happen.

Suppose following node-topic bounds happens with each user never had any p2p connection with each other (I used new users to ensure this property):

  • Node 1:
    • User A
  • Node 2:
    • User B
  • Node 3:
    • User C

When Node 1 dies, failover occurs, then suppose the result would be following:

  • Node 2:
    • User B
    • User A
  • Node 3:
    • User C

When User C try to initiate p2p connection with User A, it would get Permission Denied error. But when User B try to initiate p2p connection with User A, it will be successfully created. But when we add more users to both Node 2 & Node 3, sometimes this event would be occurred, sometimes it doesn't. But if the user bound to Node 3, then it more likely to get the error when it tried to initiate connection with User A. This bug also sometimes occurs without failover event.

The second bug was occurred when I delete p2p subscription which topic is bound to another node then recreate it. So for example User B successfully created p2p connection with User C, then User B delete the subscription, then recreate it. There possible outcome would be 3 following cases:

  • when it is successfully recreated, user own permissions would only have 'A' permission
  • when it is successfully recreated, user own permissions would have 'N' permission, so does with user in another side
  • when it is unsuccessfully recreated, it throws "Permission Denied" error to the user

Let me know your opinion.

Thanks

Looking into it. Thanks.

I just fixed one bug which probably caused all of the above, specifically line 379:
ed5a448#diff-d412b9240514b30a308e4b4c8208ae72R379
See if you can reproduce it now.

Hello, Gene

Yes, it looks like the lattest commit fix the first bug. So far I’m unable to reproduce the error (y).

For the second bug, it still there. But it looks like it doesn’t relate to cluster code since I also successfully produce it on standalone server.

Here are the steps which I was using to produce the bug:

  1. login with any account
  2. subscribe to any account
  3. delete subscription
  4. resubscribe
  5. user’s own permission would be only A

You could see how I produce the bug here:
Resubscribe Permission Bug

I also discover another bug which could make the server crash when user deleting its p2p subscription.

Here are the steps to produce it:

  1. login with two different account, ex. userA & userB
  2. initiate p2p connection between userA & userB, for example we use userA
  3. delete userA’s subscription
  4. userB subscribe to p2p topic with userA
  5. server explodes :D

You could see how I produce the bug here:

P2P Resubscribe Bomb Bug

Let me know your opinion.

Thanks

Looking into it. Thanks.

The first bug is fixed with 0b68257

The second one is clear, I can make it go away, but I want to make it right. So it would probably take me a bit of time to refactor some code.

No problem, Gene. I think it’s better for the project (y).

I think it's fixed now: de54966

If you don't mind, please file bugs separately instead of adding to this thread. Thanks!