valkey-io/valkey

[BUG] Clusters can become inconsistent if one side times out the handshake during cluster meets

Closed this issue · 2 comments

The following TCL test will reproduce the issue:

start_cluster 2 0 {tags {external:skip cluster} overrides {cluster-replica-no-failover yes}} {
    test "Partial node timeout with two shards" {
        set cluster_port [find_available_port $::baseport $::portcount]
        start_server [list overrides [list cluster-enabled yes cluster-node-timeout 1 cluster-port $cluster_port]] {
            # Node 1 - Meet -> Node 2
            ## Node 1 and Node 2 will have eachother in the handshake state
            ## Node 2 will respond with a pong message. Which will put Node 2 online
            ## Node 2 should respond 'eventually' with a ping message, but we drop it.
            ## Node 2 will timeout its handshake with 1.
            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER 0
            R 2 DEBUG DROP-CLUSTER-PACKET-FILTER 0

            R 1 CLUSTER MEET [srv 0 host] [srv 0 port] $cluster_port

            # Note, the delay is very tight since we want to react fast
            # to the node recieving the meet message.
            wait_for_condition 1000 1 {
                [llength [R 0 CLUSTER NODES]] eq 16
            } else {
                fail "Node 0 never learned about node 1"
            }

            # Pause the process to prevent it from getting or sending messages
            pause_process [srv -1 pid]

            # Now wait for the node 1 to give up the handshake
            wait_for_condition 50 100 {
                [llength [R 0 CLUSTER NODES]] eq 8
            } else {
                fail "Node 1 never learned about node 1"
            }

            resume_process [srv -1 pid]
            R 0 DEBUG DROP-CLUSTER-PACKET-FILTER -1
            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER -1
            R 2 DEBUG DROP-CLUSTER-PACKET-FILTER -1

            after 1000
            puts "Nodes after 1 second"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]

            after 30000
            puts "Nodes after 30 seconds"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]
        }
    }
}

You'll get something like the following output:

Nodes after 1 second
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 myself,master - 0 0 0 connected

01fc6f143b05f1723be7b85a7368f16a3410ac52 127.0.0.1:21611@31611 master,nofailover - 0 1730690285488 1 connected 8192-16383
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 master - 0 1730690285488 0 connected
156735b32f34055bf448bb1c8ecbd6b626b589da 127.0.0.1:21612@31612 myself,master,nofailover - 0 0 2 connected 0-8191

156735b32f34055bf448bb1c8ecbd6b626b589da 127.0.0.1:21612@31612 master,nofailover - 0 1730690285407 2 connected 0-8191
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 master - 0 1730690285407 0 connected
01fc6f143b05f1723be7b85a7368f16a3410ac52 127.0.0.1:21611@31611 myself,master,nofailover - 0 0 1 connected 8192-16383

Nodes after 30 seconds
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 myself,master - 0 0 0 connected

01fc6f143b05f1723be7b85a7368f16a3410ac52 127.0.0.1:21611@31611 master,nofailover - 0 1730690295487 1 connected 8192-16383
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 master - 0 1730690295487 0 connected
156735b32f34055bf448bb1c8ecbd6b626b589da 127.0.0.1:21612@31612 myself,master,nofailover - 0 0 2 connected 0-8191

156735b32f34055bf448bb1c8ecbd6b626b589da 127.0.0.1:21612@31612 master,nofailover - 0 1730690295304 2 connected 0-8191
4534af6491d293645e32f4000f958590a19c61be 127.0.0.1:21614@21613 master - 0 1730690295304 0 connected
01fc6f143b05f1723be7b85a7368f16a3410ac52 127.0.0.1:21611@31611 myself,master,nofailover - 0 0 1 connected 8192-16383

Node 4534af6491d293645e32f4000f958590a19c61be in this case timed out its handshake with the other nodes who put it online. The other two nodes will send it normal ping messages, not MEETS, because they only send a first meet message is put online. Node 4534af6491d293645e32f4000f958590a19c61be will ignore those ping messages.

I think the larger the cluster the less likely this will be to happen, since as long as one node connects and know about the rest of the cluster, it will spread gossip which will retry the connections.

Not sure this is a large issue at the moment, but wanted to document the finding.

Was also able to get the inverse state, where the new node knows the the meeting nodes but the old nodes do not.

proc cluster_get_first_node_in_handshake id {
    set nodes [get_cluster_nodes $id]
    foreach n $nodes {
        if {[cluster_has_flag $n handshake]} {
            return [dict get $n id]
        }
    }
    return {}
}

start_cluster 2 0 {tags {external:skip cluster} overrides {cluster-node-timeout 4000 cluster-replica-no-failover yes}} {
    test "Partial node timeout with two shards" {
        set cluster_port [find_available_port $::baseport $::portcount]
        start_server [list overrides [list cluster-enabled yes cluster-node-timeout 4000 cluster-port $cluster_port]] {
            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER -2
            R 2 DEBUG DROP-CLUSTER-PACKET-FILTER -2

            R 1 CLUSTER MEET [srv 0 host] [srv 0 port] $cluster_port
            # Assume this is enough time for node 0 to get the meet, w could make this deterministic but I'm lazy.
            after 1000
            # Since we are in handshake, we use a randomly generated ID we have to find
            R 1 DEBUG CLUSTERLINK KILL ALL [cluster_get_first_node_in_handshake 1]

            wait_for_condition 50 100 {
                [cluster_get_first_node_in_handshake 1] eq {}
            } else {
                fail "Node never exited handshake state"
            }

            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER -1
            R 2 DEBUG DROP-CLUSTER-PACKET-FILTER -1


            after 1000
            puts "Nodes after 1 second"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]

            after 10000
            puts "Nodes after 10 seconds"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]
        }
    }
}

Note, this requires a special patch for dropping all packets. We have to dodge the ping, pong, and meet packets that the new node we're adding my send out. I'll submit a PR in a second so it's easy to reproduce.

Produces:

Nodes after 1 second
33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 master,nofailover - 0 1730765784074 1 connected 8192-16383
2805f195fc98926c736d23fa151d007ba538f429 127.0.0.1:21114@21113 myself,master - 0 0 0 connected

33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 master,fail?,nofailover - 1730765779951 1730765779750 1 connected 8192-16383
67be05cdf2117e32987a871ee290d70ca9fabeae 127.0.0.1:21112@31112 myself,master,nofailover - 0 0 0 connected 0-8191

67be05cdf2117e32987a871ee290d70ca9fabeae 127.0.0.1:21112@31112 master,fail?,nofailover - 1730765780089 1730765779889 0 connected 0-8191
33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 myself,master,nofailover - 0 0 1 connected 8192-16383

Nodes after 10 seconds
67be05cdf2117e32987a871ee290d70ca9fabeae 127.0.0.1:21112@31112 master,nofailover - 0 1730765794733 0 connected 0-8191
33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 master,nofailover - 0 1730765793728 1 connected 8192-16383
2805f195fc98926c736d23fa151d007ba538f429 127.0.0.1:21114@21113 myself,master - 0 0 2 connected

33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 master,fail?,nofailover - 1730765779951 1730765779750 1 connected 8192-16383
67be05cdf2117e32987a871ee290d70ca9fabeae 127.0.0.1:21112@31112 myself,master,nofailover - 0 0 0 connected 0-8191

67be05cdf2117e32987a871ee290d70ca9fabeae 127.0.0.1:21112@31112 master,fail?,nofailover - 1730765780089 1730765779889 0 connected 0-8191
33b594e81bbf2cf81e86935e5b8532bc6ccf2028 127.0.0.1:21111@31111 myself,master,nofailover - 0 0 1 connected 8192-16383

EDIT: Das PR, #1252.

Another simpler way to trigger this inconsistent view of the cluster is by preventing the node being met to receive the final PONG message of the handshake protocol.

proc cluster_get_first_node_in_handshake id {
    set nodes [get_cluster_nodes $id]
    foreach n $nodes {
        if {[cluster_has_flag $n handshake]} {
            return [dict get $n id]
        }
    }
    return {}
}

start_cluster 2 0 {tags {external:skip cluster} overrides {cluster-node-timeout 4000 cluster-replica-no-failover yes}} {
    test "Partial node timeout with two shards" {
        set cluster_port [find_available_port $::baseport $::portcount]
        start_server [list overrides [list cluster-enabled yes cluster-node-timeout 4000 cluster-port $cluster_port]] {
            # In this test we will trigger a handshake timeout on one side of the handshake.
            # Node 1 and 2 already know each other, then we make node 0 meet node 1:
            #
            # Node 0 -- MEET -> Node 1
            # Node 0 <- PONG -- Node 1
            # Node 0 <- PING -- Node 1 [Node 0 will mark the handshake as successful]
            # Node 0 -- PONG -> Node 1 [we drop this message, so node 1 will eventually mark the handshake as timed out]

            # Drop PONG msgs
            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER 1

            # Node 0 meets node 1
            R 0 CLUSTER MEET [srv -1 host] [srv -1 port]

            # Wait for node 0 to know about the other nodes in the cluster
            wait_for_condition 50 100 {
                [llength [R 0 CLUSTER NODES]] == 26
            } else {
                fail "Node 0 never learned about node 1 and 2"
            }
            # At this point, node 0 learned about the other nodes in the cluster from meeting node 1.
            wait_for_condition 50 100 {
                [cluster_get_first_node_in_handshake 0] eq {}
            } else {
                fail "Node 1 never exited handshake state"
            }
            # At this point, from node 0 point of view, the handshake with node 1 succeeded.

            puts "====== cluster nodes"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]
            puts "======"

            wait_for_condition 50 100 {
                [cluster_get_first_node_in_handshake 1] eq {}
            } else {
                fail "Node 1 never exited handshake state"
            }
            assert {[llength [R 1 CLUSTER NODES]] == 18}
            # At this point, from node 1 point of view, the handshake with node 0 timed out.

            # Allow all msgs
            R 1 DEBUG DROP-CLUSTER-PACKET-FILTER -1

            after 1000
            puts "Nodes after 1 second"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]

            after 10000
            puts "Nodes after 10 seconds"
            puts [R 0 CLUSTER NODES]
            puts [R 1 CLUSTER NODES]
            puts [R 2 CLUSTER NODES]
        }
    }
}

Results of this test run:

====== cluster nodes
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 0 1731114363645 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114363544 0 connected 0-8191
1b58b27751a5818d4b43363809922a722ead239f 127.0.0.1:21114@21113 myself,master - 0 0 2 connected

329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 1731114362812 1731114362711 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 myself,master,nofailover - 0 0 0 connected 0-8191
7d14f4c0510929f2d8282eefb6445d1f730b73d3 127.0.0.1:21114@21113 handshake - 1731114362912 0 0 connected

54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114363592 0 connected 0-8191
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 myself,master,nofailover - 0 0 1 connected 8192-16383

======
Nodes after 1 second
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 0 1731114367554 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114366953 0 connected 0-8191
1b58b27751a5818d4b43363809922a722ead239f 127.0.0.1:21114@21113 myself,master - 0 0 2 connected

329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 1731114362812 1731114362711 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 myself,master,nofailover - 0 0 0 connected 0-8191

54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114367804 0 connected 0-8191
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 myself,master,nofailover - 0 0 1 connected 8192-16383

Nodes after 10 seconds
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 0 1731114377578 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114376575 0 connected 0-8191
1b58b27751a5818d4b43363809922a722ead239f 127.0.0.1:21114@21113 myself,master - 0 0 2 connected

329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 master,nofailover - 1731114362812 1731114362711 1 connected 8192-16383
54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 myself,master,nofailover - 0 0 0 connected 0-8191

54918936be0487170b954f26e7b4810457837742 127.0.0.1:21112@31112 master,nofailover - 0 1731114377935 0 connected 0-8191
329334e8320192bc62cfdbf245101e3295a12a86 127.0.0.1:21111@31111 myself,master,nofailover - 0 0 1 connected 8192-16383