whatyouhide/xandra

EXITs not handled when Cassandra node goes down

Closed this issue · 5 comments

Currently, when a cassandra node goes down, we get some errors at first, among them some process exits. We get those errors on some load with ~200 queries/second on 3 node cluster with pool_size of 10. So when the processes are being used to execute queries, they fail with EXITs which aren't caught.

Stacktrace
2023-04-19 09:17:48.731 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 1 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.732 module=DBConnection.Connection pid=<0.2162.0> [error]: Xandra.Connection (#PID<0.2162.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.732 module=DBConnection.Connection pid=<0.2160.0> [error]: Xandra.Connection (#PID<0.2160.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.733 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: stopped retrying, error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.733 module=Worker.CounterRetryStrategy pid=<0.2226.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.733 module=DBConnection.Connection pid=<0.2163.0> [error]: Xandra.Connection (#PID<0.2163.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.734 module=DBConnection.Connection pid=<0.2158.0> [error]: Xandra.Connection (#PID<0.2158.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2163.0> [error]: Xandra.Connection (#PID<0.2163.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2160.0> [error]: Xandra.Connection (#PID<0.2160.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.735 module=DBConnection.Connection pid=<0.2162.0> [error]: Xandra.Connection (#PID<0.2162.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.736 module=DBConnection.Connection pid=<0.2158.0> [error]: Xandra.Connection (#PID<0.2158.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.765 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.765 module=DBConnection.Connection pid=<0.2164.0> [error]: Xandra.Connection (#PID<0.2164.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.766 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.766 module=DBConnection.Connection pid=<0.2164.0> [error]: Xandra.Connection (#PID<0.2164.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.766 module=DBConnection.Connection pid=<0.2167.0> [error]: Xandra.Connection (#PID<0.2167.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 1 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2165.0> [error]: Xandra.Connection (#PID<0.2165.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: stopped retrying, error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=Worker.CounterRetryStrategy pid=<0.2232.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2167.0> [error]: Xandra.Connection (#PID<0.2167.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.767 module=DBConnection.Connection pid=<0.2166.0> [error]: Xandra.Connection (#PID<0.2166.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.768 module=DBConnection.Connection pid=<0.2165.0> [error]: Xandra.Connection (#PID<0.2165.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.768 module=DBConnection.Connection pid=<0.2166.0> [error]: Xandra.Connection (#PID<0.2166.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.872 module=Worker.CounterRetryStrategy pid=<0.2238.0> [warning]: Retrying. retries_left: 3 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.872 module=DBConnection.Connection pid=<0.2161.0> [error]: Xandra.Connection (#PID<0.2161.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.872 module=Worker.CounterRetryStrategy pid=<0.2238.0> [warning]: Retrying. retries_left: 2 error: %Xandra.ConnectionError{action: "execute", reason: :closed}
2023-04-19 09:17:48.872 module=DBConnection.Connection pid=<0.2159.0> [error]: Xandra.Connection (#PID<0.2159.0>) disconnected: ** (Xandra.ConnectionError) action "execute" failed with reason: socket is closed
2023-04-19 09:17:48.873 module=supervisor pid=<0.2236.0> [error]: Child Worker of Supervisor #PID<0.2236.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2156.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
    ** (EXIT) shutdown
Pid: #PID<0.2238.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100093, cluster_map: [...]])
2023-04-19 09:17:48.873 module=DBConnection.Connection pid=<0.2159.0> [error]: Xandra.Connection (#PID<0.2159.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused
2023-04-19 09:17:48.873 module=DBConnection.Connection pid=<0.2161.0> [error]: Xandra.Connection (#PID<0.2161.0>) failed to connect: ** (Xandra.ConnectionError) action "TCP connect" failed with reason: connection refused

The process exits repeat themselves after each periodic topology refresh when the node isn't up again, which is annoying and causes log spam.

Stacktrace
2023-04-19 09:21:13.716 module=supervisor pid=<0.2488.0> [error]: Child Worker of Supervisor #PID<0.2488.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
    ** (EXIT) shutdown
Pid: #PID<0.2490.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100025, cluster_map: [...]])
2023-04-19 09:21:13.716 module=supervisor pid=<0.2476.0> [error]: Child Worker of Supervisor #PID<0.2476.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
    ** (EXIT) shutdown
Pid: #PID<0.2478.0>
Start Call: Worker.start_link([db_conn: #PID<0.2135.0>, serial: 100021, cluster_map: [...]])
2023-04-19 09:21:13.716 module=supervisor pid=<0.2491.0> [error]: Child Worker of Supervisor #PID<0.2491.0> (Supervisor.Server) terminated
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2913.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
    ** (EXIT) shutdown
Pid: #PID<0.2493.0>
... [many more identical logs]

See: #273 #183

This:

** (exit) exited in: DBConnection.Holder.checkout(#PID<0.2156.0>, [telemetry_metadata: %{}, decimal_format: :tuple, timestamp_format: :datetime, time_format: :time, date_format: :date, tracing: false, page_size: 10000, uuid_format: :binary, timeuuid_format: :binary, retry_strategy: Worker.CounterRetryStrategy, retry_count: 3])
    ** (EXIT) shutdown

is quite annoying yes. I’m not even sure where it comes from though, I think #PID<0.2156.0> is the pool, but I’m not sure why it exits with reason shutdown. Any ideas?

Yes, I have confirmed that that pid is indeed the pool, but unfortunately, I don't have any idea why DBConnection.Holder.checkout/2 fails with exit either.

However, we might wanna consider using catch instead of rescue to handle EXITs as well?

@harunzengin I’m a bit wary of using catches blindly because I’m don't want to be accidentally catching stuff we're not supposed to. 🤔

Tracking it down, it looks like there are some conditions in which the holder exits here and also, after it sends this message to the pool here: {:db_connection, {self(), lock}, {:checkout, callers, start, queue?}} it can exit with whatever reason the pool shuts :DOWN here. But I didn't get how exactly it shuts down.

This should be fixed.