Shutting down Cassandra node causes process exit

Question

Shutting down Cassandra node causes process exit

harunzengin opened this issue 6 months ago · 2 comments

While testing #358 and shutting nodes down, I realized that we get exits like following when a node is shut down:

 ** (stop) exited in: :gen_statem.call(#PID<0.5947.0>, {:checkout_state_for_next_request, #Reference<0.0.460547.2884031648.17891330.200184>}, :infinity)
     ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     (stdlib 5.2) gen.erl:246: :gen.do_call/4
     (stdlib 5.2) gen_statem.erl:923: :gen_statem.call/3
     (xandra 0.18.1) lib/xandra/connection.ex:158: Xandra.Connection.execute/4
     (xandra 0.18.1) lib/xandra.ex:1272: Xandra.execute_without_retrying/4
     (xandra 0.18.1) lib/xandra/retry_strategy.ex:309: Xandra.RetryStrategy.run_on_cluster/5

I guess the connection processes get terminated right after Xandra.Cluster.Pool.checkout returns the connection pids. This causes the client processes to terminate as well. The RetryStrategy cannot try the query on another node in this case I think.

Answer 1 · 2024-03-13T13:10:54.000Z

Ah, gotcha, yes this makes sense. @harunzengin I think the solution here is to guard against exits when calling Xandra.Connection.execute/4. The thing I’m trying to figure out is where to guard against this. We could do it in Xandra.Connection.execute/4 itself, but that worries me because it applies to non-cluster connections too (which should not go down).

An alternative is to do it in in places like this, where instead of calling Xandra.Connection.execute/4 we wrap it up. Something like:

    with_conn_and_retrying(cluster, options, fn conn ->
      try do
        Xandra.execute(conn, query, params, options_without_retry_strategy)
      catch
        # IIRC this is what it looks like but this needs to be tested.
        :exit, {:noproc, _} ->
          {:error, ...}
    end)

Thoughts? Can you work on a PR? I won't have time this week.

Answer 2 · 2024-03-28T08:29:13.000Z

@harunzengin ping 🙃