whatyouhide/xandra

Isolated Cassandra node results in `{:cluster, :not_connected}`

Closed this issue · 5 comments

I have the issue that I am getting {:cluster, :not_connected} when one of 3 cassandra nodes is isolated.

Current setup:
3 cassandra nodes, one gets isolated (simulated via iptables DROP of port 7000)

iex(1)> {:ok, pid} = Xandra.Cluster.start_link(authentication: {Xandra.Authenticator.Password, username: "<user>", password: "<pw>"}, nodes: ["ip1:9042", "ip2:9042", "ip3:9042"], pool_size: 10)
{:ok, #PID<0.877.0>}

iex(2)> :sys.get_state(pid) 
%Xandra.Cluster{
autodiscovered_nodes_port: 9042,
  autodiscovery: true,
  load_balancing: :random,
  node_refs: [
    {#Reference<0.2492756426.2133852163.51413>, {ip1}},
    {#Reference<0.2492756426.2133852163.51415>, {ip2}},
    {#Reference<0.2492756426.2133852163.51417>, {ip3}}
  ],
  options: [
    protocol_module: Xandra.Protocol.V3,
    idle_interval: 30000,
    protocol_version: :v3,
    authentication: {Xandra.Authenticator.Password,
     [username: "<user>", password: "<pw>"]},
    pool_size: 10
  ],
  pool_supervisor: #PID<0.878.0>,
  pools: %{
    {ip1} => #PID<0.885.0>,
    {ip2} => #PID<0.909.0>,
    {ip3} => #PID<0.897.0>
  }
}

Now simulate node isolation on one cassandra node:

$ iptables -I INPUT -p tcp --dport 7000 -j DROP; iptables -I OUTPUT -p tcp --dport 7000 -j DROP;

After some seconds:

iex(3)> :sys.get_state(pid)
%Xandra.Cluster{
  autodiscovered_nodes_port: 9042,
  autodiscovery: true,
  load_balancing: :random,
  node_refs: [
    {#Reference<0.2492756426.2133852163.51413>, {ip1}},
    {#Reference<0.2492756426.2133852163.51415>, {ip2}},
    {#Reference<0.2492756426.2133852163.51417>, {ip3}}
  ],
  options: [
    protocol_module: Xandra.Protocol.V3,
    idle_interval: 30000,
    protocol_version: :v3,
    authentication: {Xandra.Authenticator.Password,
     [username: "<user>", password: "<pw>"]},
    pool_size: 10
  ],
  pool_supervisor: #PID<0.878.0>,
  pools: %{}
}

See here the state if the Xandra.Cluster pools: %{} which results in {:cluster, :not_connected}

I also tried the 0.14.0 but except from this #262 issue, it's still broken.

I also tried the current master and there it works.

When do you plan a new release?

I debugged this a bit deeper and found more details:

Let's say we have 3 cassandra nodes.

The driver (Xandra.Cluster) opens to every node a control connection.

Now we 'isolate' node1 via iptables --dport 7000 -j DROP.

All 3 control connection's are still working because we are only blocking port 7000 and they are continue reporting cluster events.

node2 and node3 reports StatusChanged{reason: "DOWN", node: "node1"}
Which is correct, from their point of view.

BUT!:

node1 reports StatusChanged{reason: "DOWN", node: "node2"} and StatusChanged{reason: "DOWN", node: "node3"}

Which is also correct, from their point of view.

So the driver things all nodes are down. Which results now in {:cluster, :not_connected}

@franke1276 thanks for the report! Does this happen on the main branch too?

@franke1276 thanks for the report! Does this happen on the main branch too?

Yes, it also happen there.

@franke1276 I've done some changes to the main branch. Now, we only open a single control connection to one of the nodes in the cluster. In your case, for example, if we open the connection to node1, then we'll indeed see node2 and node3 as down. In my opinion, that's probably the correct behavior: we're trusting what the cluster says.

Could you give the new main a try and see what happens?

I’m closing this one since I don't believe it's valid anymore after the big round of changes that happened in the last couple of weeks. If this is still an issue, we can open a new issue and look into it!

Thanks for the original report @franke1276 💟