Another approach to mitigate split-brain of a long-lived connection
tedkimdev opened this issue · 2 comments
Description:
The split-brain of a long-lived connection occurs a long-lived connection like WatchDocument will keep connected to the old server and not be rerouted to the new server when a server is added or removed in the cluster.
Current Solution:
Sharded Cluster Mode Risks and Mitigation.
Possible Solution: Use a Centralized Event Broker
By handling the Server Up Event
and Server Down Event
, send a reconnect message to the clients or disconnect the DocumentWatchedEvent
stream. These events would include relevant information such as the event name, server identifier, and timestamp.
Illustrative Process:
Consider the following steps as an example,
-
A cluster consists of two servers: Server A and Server B.
-
When Server A goes offline, Server A publishes a
Server Down Event
to Pub/Sub. -
Server B detects the event from Pub/Sub and starts to store clients' information on the new stream connection until another server is available(the successor of server A).
-
When Server C is up, server C publishes
Server Up Event
to Pub/Sub. -
Server B receives the "Server Up Event" from Pub/Sub, finds the stream connection established between the time of Server A's downtime and Server C's uptime, and disconnects the streams or sends reconnect messages to the clients.
-
The Watch stream in a split-brain state on Server B will be reconnected to Server C.
Why:
It decreases the unnecessary disconnection in long-lived connections, reducing the risk of a split brain.
I think we can dramatically reduce the overhead of re-establishing connection due to periodic stream timeout to workaround split-brain by introducing your solution.
One thing I'm curious about is whether there is a way to store stream connection information somewhere and optionally disconnect the stream connection when we want.
So I think it will be good to PoC this and see if we can use your solution.
Yorkie had an etcd implementation of the sync
package, which provided synchronization between Yorkie servers by using pub/sub. This will help you build the pub/sub component with Yorkie servers. See older commits from this PR: #504.
Updates: We are currently discussing for another approach to mitigate this issue.
The idea is that we let client detect rpc session id (server id) and proactively re-establish watch stream when rpc session id changes.