whatyouhide/xandra

Xandra.Cluster does not detect disconnection

Closed this issue · 0 comments

jvf commented

Say we use Xandra.Cluster connected to multiple Cassandra nodes. Now one of the connections breaks. The network problem is only between the client host and one Cassandra node, meaning the Cassandra cluster itself is still healthy, all nodes can see each other. In this scenario Xandra will keep routing queries to the dysfunctional connection. All these queries will fail (and keep failing).

Demonstration of the problem

Assume a multi node Cassandra cluster. First we establish a Xandra connection to two of the Cassandra nodes:

jvf@client-host:~/ > iex
iex(1)> Mix.install([{:xandra, "0.13.1"}])
iex(2)> Logger.put_application_level(:xandra, :debug)
iex(3)> auth = {Xandra.Authenticator.Password, username: "<*****>", password: "<*****>"}
iex(4)> nodes = ["172.28.32.38:9042", "172.28.32.132:9042"]
iex(5)> {:ok, conn} = Xandra.Cluster.start_link(nodes: nodes, authentication: auth, autodiscovery: false, load_balancing: :priority)
iex(6)>
09:49:03.726 [debug] Control connection for 172.28.32.132:9042 is up

09:49:03.756 [debug] Started connection to {172, 28, 32, 132}

09:49:03.756 [debug] Control connection for 172.28.32.38:9042 is up

09:49:03.757 [debug] Started connection to {172, 28, 32, 38}

iex(7)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:49:09.642333Z
[%{"rpc_address" => {172, 28, 32, 38}}]

As you can see the connection to both nodes is established. As we are using load_balancing: :priority all quries are routed to the first listed node.

We use load_balancing: :priority here as this makes it easier to demonstrate the problem but this also works with load_balancing: :random, you just have to retry the Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") multiple times each to hit the correct node.

Now, in another terminal on the first Cassandra node, stop the node:

jvf@172.28.32.38:~/ > sudo systemctl stop cassandra ; date -u +%FT%TZ
2023-03-03T09:49:27Z

In the client session you can see that the node is detected as DOWN. We can still query the cluster, the second configured node is used:

09:49:22.028 [debug] Received event: %Xandra.Cluster.StatusChange{effect: "DOWN", address: {172, 28, 32, 38}, port: 9042}

09:49:22.041 [debug] Received event: %Xandra.Cluster.StatusChange{effect: "DOWN", address: {172, 28, 32, 38}, port: 9042}
iex(8)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:49:35.415527Z
[%{"rpc_address" => {172, 28, 32, 132}}]

Now, again in another terminal on the first Cassandra node, restart the node:

jvf@172.28.32.38:~/ > sudo systemctl restart cassandra ; date -u +%FT%TZ
2023-03-03T09:49:45Z

In the client session you can see that the node is detected as UP and once we query we get the original node:

09:50:05.164 [debug] Received event: %Xandra.Cluster.StatusChange{effect: "UP", address: {172, 28, 32, 38}, port: 9042}
iex(9)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:50:14.878490Z
[%{"rpc_address" => {172, 28, 32, 38}}]

Now, to simulate a broken network connection, we go to another terminal on the client node and use iptables to block the traffic to our first Cassandra node:

root@client-host:~# mode=I; iptables -"$mode" OUTPUT -p tcp --dport 9042 -j DROP -d 172.28.32.38 && date -u +%FT%TZ
2023-03-03T09:50:24Z

In the client session nothing happens, the node is not detected as DOWN because from Cassandra's point of view the cluster is perfectly healty. But queries to this node obviously can't not work anymore:

iex(10)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:50:29.325737Z

09:50:44.331 [error] Xandra.Connection (#PID<0.219.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.108.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.108.0> was at location:

    :prim_inet.recv0/3
    (xandra 0.13.1) lib/xandra/connection/utils.ex:12: Xandra.Connection.Utils.recv_frame/4
    (xandra 0.13.1) lib/xandra/connection.ex:193: Xandra.Connection.handle_execute/4
    (db_connection 2.4.3) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.3) lib/db_connection.ex:1413: DBConnection.run_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:1508: DBConnection.run/6
    (db_connection 2.4.3) lib/db_connection.ex:644: DBConnection.parsed_prepare_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:636: DBConnection.prepare_execute/4

** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
    (xandra 0.13.1) lib/xandra/cluster.ex:344: Xandra.Cluster.execute!/3
    iex:10: (file)
iex(10)>
09:50:49.333 [error] Xandra.Connection (#PID<0.219.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: :timeout

09:50:56.868 [error] Xandra.Connection (#PID<0.219.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: :timeout

09:51:06.699 [error] Xandra.Connection (#PID<0.219.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: :timeout

09:51:25.497 [error] Xandra.Connection (#PID<0.219.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: :timeout

iex(10)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:51:48.975583Z
** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2189ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

    (xandra 0.13.1) lib/xandra/cluster.ex:344: Xandra.Cluster.execute!/3
    iex:10: (file)

In another terminal on the client node we restore the network conneciton:

root@battery-proxy-faux:/home/jvf# mode=D; iptables -"$mode" OUTPUT -p tcp --dport 9042 -j DROP -d 172.28.32.38 && date -u +%FT%TZ
2023-03-03T09:52:03Z

The connection re-establishes and is usable again:

iex(10)> IO.puts(DateTime.utc_now); Xandra.Cluster.execute!(conn, "SELECT rpc_address FROM system.local") |> Enum.take(1)
2023-03-03 09:52:10.463128Z
[%{"rpc_address" => {172, 28, 32, 38}}]

Explanation

My understanding is this: Xandra subscribes to STATUS_CHANGE events from Cassandra and acts on these event with the control connections. If a control connection receives a DOWN event the connection pool using this node is terminated.

Xandra doesn't have any other mechanism to determine whether a connection is healty or not. But a connection can be broken, for example due to network issues between the client host and one specific Cassandra node. In this case, as shown above, Xandra will continue to route queries to the dysfunctional connection pool and all these queries will fail.

Desired behaviour

Xandra should detect the faulty connection and route all queries to the working connection. If the faulty connection heals Xandra should eventually start using that connection again.

Proposal

I think to solve this Xandra needs to be able to detect a failing connection and use this information - in addition to the STATUS_CHANGE events - to determine the health of a connection.

My idea for failure detection is to use Connection listeners from DBConnection (plus Process.monitor/1). If all connection processes are down we consider the connection as disconnected and terminate the pool. One could also do this using a circuit breaker pattern.

This also requires a mechanism for reconnecting the connectuion once it heals as we will not receive an UP event in this scenario. Probably something like periodic reconnection attempts by the control connection or similar.