EXITs not handled when Cassandra node goes down
Closed this issue · 5 comments
Currently, when a cassandra node goes down, we get some errors at first, among them some process exits. We get those errors on some load with ~200 queries/second on 3 node cluster with pool_size of 10. So when the processes are being used to execute queries, they fail with EXIT
s which aren't caught.
Stacktrace
2023-04-19 09:17:48.731 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 1 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=DBConnection.Connection pid=<0.2162.0> [error]: Xandra.Connection (#PID<0.2162.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.732 module=DBConnection.Connection pid=<0.2160.0> [error]: Xandra.Connection (#PID<0.2160.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.733 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: stopped retrying, error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.733 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.733 module=DBConnection.Connection pid=<0.2163.0> [error]: Xandra.Connection (#PID<0.2163.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.734 module=DBConnection.Connection pid=<0.2158.0> [error]: Xandra.Connection (#PID<0.2158.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2163.0> [error]: Xandra.Connection (#PID<0.2163.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2160.0> [error]: Xandra.Connection (#PID<0.2160.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2162.0> [error]: Xandra.Connection (#PID<0.2162.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.736 module=DBConnection.Connection pid=<0.2158.0> [error]: Xandra.Connection (#PID<0.2158.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.765 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.765 module=DBConnection.Connection pid=<0.2164.0> [error]: Xandra.Connection (#PID<0.2164.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.766 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.766 module=DBConnection.Connection pid=<0.2164.0> [error]: Xandra.Connection (#PID<0.2164.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.766 module=DBConnection.Connection pid=<0.2167.0> [error]: Xandra.Connection (#PID<0.2167.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 1 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2165.0> [error]: Xandra.Connection (#PID<0.2165.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: stopped retrying, error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2167.0> [error]: Xandra.Connection (#PID<0.2167.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2166.0> [error]: Xandra.Connection (#PID<0.2166.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.768 module=DBConnection.Connection pid=<0.2165.0> [error]: Xandra.Connection (#PID<0.2165.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.768 module=DBConnection.Connection pid=<0.2166.0> [error]: Xandra.Connection (#PID<0.2166.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.872 module=Worker.CounterRetryStrategy pid=<0.2238.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.872 module=DBConnection.Connection pid=<0.2161.0> [error]: Xandra.Connection (#PID<0.2161.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.872 module=Worker.CounterRetryStrategy pid=<0.2238.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.872 module=DBConnection.Connection pid=<0.2159.0> [error]: Xandra.Connection (#PID<0.2159.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.873 module=supervisor pid=<0.2236.0> [error]: Child Worker of Supervisor #PID<0.2236.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2156.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
** (EXIT) shutdown
Pid: #PID<0.2238.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100093, cluster_map: [...]])
2023-04-19 09:17:48.873 module=DBConnection.Connection pid=<0.2159.0> [error]: Xandra.Connection (#PID<0.2159.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.873 module=DBConnection.Connection pid=<0.2161.0> [error]: Xandra.Connection (#PID<0.2161.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
The process exits repeat themselves after each periodic topology refresh when the node isn't up again, which is annoying and causes log spam.
Stacktrace
2023-04-19 09:21:13.716 module=supervisor pid=<0.2488.0> [error]: Child Worker of Supervisor #PID<0.2488.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
** (EXIT) shutdown
Pid: #PID<0.2490.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100025, cluster_map: [...]])
2023-04-19 09:21:13.716 module=supervisor pid=<0.2476.0> [error]: Child Worker of Supervisor #PID<0.2476.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
** (EXIT) shutdown
Pid: #PID<0.2478.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100021, cluster_map: [...]])
2023-04-19 09:21:13.716 module=supervisor pid=<0.2491.0> [error]: Child Worker of Supervisor #PID<0.2491.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
** (EXIT) shutdown
Pid: #PID<0.2493.0>
... [many more identical logs]
This:
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2156.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
** (EXIT) shutdown
is quite annoying yes. I’m not even sure where it comes from though, I think #PID<0.2156.0>
is the pool, but I’m not sure why it exits with reason shutdown
. Any ideas?
Yes, I have confirmed that that pid is indeed the pool, but unfortunately, I don't have any idea why DBConnection.Holder.checkout/2
fails with exit either.
However, we might wanna consider using catch
instead of rescue to handle EXITs as well?
@harunzengin I’m a bit wary of using catch
es blindly because I’m don't want to be accidentally catching stuff we're not supposed to. 🤔
Tracking it down, it looks like there are some conditions in which the holder exits here and also, after it sends this message to the pool here: {:db_connection, {self(), lock}, {:checkout, callers, start, queue?}}
it can exit with whatever reason the pool shuts :DOWN
here. But I didn't get how exactly it shuts down.
This should be fixed.