Supervisor of `Xandra.Cluster` receives unexpected messages

Question

Supervisor of `Xandra.Cluster` receives unexpected messages

Closed this issue a year ago · 8 comments

When starting Xandra.Cluster with a supervisor, the supervisor receives bunch of unexpected messages:

Starting Cassandra connection pool with: [name: :"CassandraConnection", keyspace: :test, username: "cassandra", password: "REDACTED", nodes: ["localhost"], pool_size: 10]

2023-08-16 14:00:10.239 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

2023-08-16 14:00:10.250 module=supervisor pid=<0.629.0> [error]: Supervisor received unexpected message: {#Reference<0.390389104.540540930.64231>, {:ok, #PID<0.635.0>}}

I checked the process info of {:ok, #PID<0.635.0>}, and it looks like it is the DBConnection.ConnectionPool:

Process.info(pid(0, 635, 0))
[
  current_function: {:gen_server, :loop, 7},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.631.0>, #PID<0.611.0>],
  dictionary: [
    "$ancestors": [#PID<0.631.0>,
     ...,
      #PID<0.622.0>],
    "$initial_call": {DBConnection.ConnectionPool, :init, 1},
    connection_module: Xandra.Connection
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.621.0>,
  total_heap_size: 2828,
  heap_size: 1598,
  stack_size: 11,
  reductions: 23743,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 56
  ],
  suspending: []
]

state while starting Xandra.Pool:

%Xandra.Cluster.Pool{
  pool_options: [
    protocol_version: :v5,
    authentication: {Xandra.Authenticator.Password,
     [username: "cassandra", password: "cassandra"]},
    after_connect: #Function<0.98838439/1 in Conn.maybe_set_keyspace/1>,
    attempt_connection: true,
    default_consistency: :local_one,
    atom_keys: true,
    pool_size: 10
  ],
  contact_nodes: [{'localhost', 9042}],
  autodiscovered_nodes_port: 9042,
  pool_supervisor: #PID<0.631.0>,
  control_connection: nil,
  refresh_topology_interval: 300000,
  load_balancing_module: Xandra.Cluster.LoadBalancingPolicy.DCAwareRoundRobin,
  load_balancing_state: %Xandra.Cluster.LoadBalancingPolicy.DCAwareRoundRobin{
    local_dc: nil,
    local_hosts: [],
    remote_hosts: []
  },
  target_pools: 1,
  sync_connect_ref: nil,
  name: {:local, :"BatterySignalsWriter.CassandraConnection"},
  peers: %{},
  reqs_before_connecting: %{max_size: 100, queue: {[], []}},
  xandra_mod: Xandra,
  control_conn_mod: Xandra.Cluster.ControlConnection
}

Answer 1 · 2023-08-16T14:39:45.000Z

@harunzengin is this in main?

Answer 2 · 2023-08-16T14:43:18.000Z

@whatyouhide yes, 20b62d1

Answer 3 · 2023-08-17T10:11:10.000Z

@harunzengin I can't seem to reproduce this. Do you have steps for this? Also, which supervisor is pid=<0.629.0>?

Answer 4 · 2023-08-17T16:25:02.000Z

@whatyouhide I created a new mix project to have a minimal reproducible example, however, I also wasn't able to reproduce it there.

However, now trying with 96303f0 , I couldn't even get our application tree started. (It did start when I wiped deps and _build after a long compilation, any try after that didn't work):

            ** (MatchError) no match of right hand side value: {:error, {{:EXIT, {:noproc, {:gen_statem, :call, [#PID<0.1690.0>, :checkout, :infinity]}}}, {:child, :undefined, Conn, {Conn, :start_link, [[name: :"CassandraConnection", password: "cassandra", username: "cassandra", keyspace: :test, nodes: ["localhost"], pool_size: 5]]}, :permanent, false, 5000, :worker, [Conn]}}}

I think there's a race condition there where something isn't ready but we try to call gen_statem.

Answer 5 · 2023-08-17T17:31:59.000Z

@harunzengin ok, yes, that makes sense. You are registering the cluster under a name, and maybe the process isn't ready when you call it through the name. Any chance you're using :sync_connect? That would explain this, for example.

Answer 6 · 2023-08-17T17:38:31.000Z

@whatyouhide No, there's no :sync_connect.

I also tried with removing the given name to the process, didn't work. Something inside Xandra has a race condition, which I'm trying to figure out.

PS: Nothing in our process tree changed, only the Xandra version.

Answer 7 · 2023-08-17T17:51:12.000Z

Immediately after starting the Xandra.Cluster, we call an attempt_connection function that looks like this:

  def attempt_connection(conn, options, tries \\ 10) do
    Process.sleep(50)
    Xandra.Cluster.execute!(conn, "SELECT key FROM system.local LIMIT 1", [], queue: false)

this causes the exit above, since we somehow get the connection pid, but it doesn't work when we try to query it (after 50ms).

Answer 8 · 2023-08-17T18:07:48.000Z

Okay, :sync_connect explains this actually, sorry about the confusion. I guess we need to attempt the connection after the process tree has started now on our side. And I wasn't able to reproduce the unexpected messages in the supervisor as well. So I'm closing this.