whatyouhide/xandra

Connection storms when recycling nodes in the cluster

Closed this issue · 4 comments

yangou commented

We had to recycle some scylla nodes for some maintenance task. What we observed is that when a node is down and topology changes. Scylla would receive a connection storm from the clients. We use below configs for connecting to scylla:

  opts = [
    nodes: ["scylladb1:9042", "scylladb2:9042", "scylladb3:9042"],
    retry_strategy: DefaultRetryStrategy,
    pool_size: 25
  ]
  
  Supervisor.child_spec({Xandra.Cluster, opts})

and our DefaultRetryStrategy is defined below:

defmodule DefaultRetryStrategy do
  @behaviour Xandra.RetryStrategy
  @retry_count 5

  @impl true
  def new(options) do
    Keyword.get(options, :retry_count, @retry_count)
  end

  @impl true
  def retry(_error, _options, _retries_left = 0), do: :error

  @impl true
  def retry(_error, options, retries_left) do
    {:retry, options, retries_left - 1}
  end
end

Would be possible it's the auto-discovery that's causing the control connection probing the cluster indefinitely?

On the same setup as above, we were seeing client connections upto 70K per node on a 32 core node. This kills the node and it doesn't served the clients anymore. This situation can only be recovered if the clients are stopped - once stopped the node recovers in minutes. And then we have to start the clients back slowly.

yangou commented

We spent more time and located the real issue.

In the event of the node being offline for over a certain number of minutes (for maintenance, etc), DBConnection to that node will exit. And the pool supervisor in Xandra.Cluster is configured with default intensity, which is once hitting 1 restart out of 5 seconds, the pool supervisor will terminate all the children and exist with shutdown as the reason.

This cascades all the way back up to the application because we are all using the default supervisor intensity options. Eventually, application terminates, and our AWS auto-scaling kicks in and keeps restarting the application.

We use ScyllaDB, and ScyllaDB uses system.clients table to keep track of connections. This table persists the client connections and uses compaction to clean up the stale connection record (which is not a good way to track transient data), and thus kills the node eventually due to a large number of connections being created and destroyed.

Proposed fix:

Introduce a GenServer called Xandra.Node in between the Xandra.Cluster's PoolSupervisor, and the Xandra's DBConnection. Trap the DBConnection exits, and use DBConnection.Backoff to manage the reconnect to the entire node with DBConnection. And able to mark the node as unavailable in the load balancing.

@yangou thanks for the report!

I’m confused by DBConnection exiting after a certain number of minutes of being disconnected. Are you passing any options to DBConnection (in Xandra.start_link/1) that make that happen, like backoff_type: :stop? Otherwise, it should keep retrying forever. 🤔

@yangou I’m closing this for inactivity. If you manage to reproduce this and post more info about DBConnection exiting, feel free to comment here and we'll take another look at this! Thanks for the report 💟