elixir-ecto/postgrex

ReplicationConnection: arg to debug errors in callback

DaemonSnake opened this issue · 9 comments

Elixir version

1.15.7

Database and Version

12

Postgrex Version

0.17.5

Current behavior

when an exception occurs in one of the user-defined callback we lost most of the information that could help in solving the issue.
For instance this issue for walex (uses postgrex's Replication connection): cpursley/walex#61

2024-05-03 14:10:57.440 [notice] Application my_app exited: MyApp.Application.start(:normal, []) returned an error: shutdown: failed to start child: WalEx.Supervisor
    ** (EXIT) shutdown: failed to start child: WalEx.Replication.Supervisor
        ** (EXIT) shutdown: failed to start child: WalEx.Replication.Server
            ** (EXIT) :function_clause

The only way to debug this king of errors (a hanlde_result doesn't match the current params, some runtime errors occurs in the callback, etc...) have been to do something like so:

 lib/postgrex/replication_connection.ex | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/postgrex/replication_connection.ex b/lib/postgrex/replication_connection.ex
index 8fcde20..faccaf3 100644
--- a/lib/postgrex/replication_connection.ex
+++ b/lib/postgrex/replication_connection.ex
@@ -546,8 +546,18 @@ defmodule Postgrex.ReplicationConnection do
     end
   end

+  defp sapply(mod, fun, args) do
+    try do
+      apply(mod, fun, args)
+    rescue
+      error ->
+        Logger.error("REPLICATION_CONNECTION:" <> Exception.format(:error, error, __STACKTRACE__))
+        raise error
+    end
+  end
+
   defp handle(mod, fun, args, from, %{streaming: streaming} = s) do
-    case apply(mod, fun, args) do
+    case sapply(mod, fun, args) do
       {:noreply, mod_state} ->
         {:keep_state, %{s | state: {mod, mod_state}}}

Expected behavior

having a more useful amount of information when the ReplicationConnection crashes when invoking user-defined callbacks.
This so that writing and maintaining a ReplicationConnection would be a more clear experience and a streamlined debuging experience

Something somewhere is not annotating the errors with more information, I'd need to dig deeper. Can you provide a small snippet that reproduces the error?

oh indeed, I tried to replicate by creating a new project using only postgrex (with the desired handle_result missing) and got the following

** (Mix) Could not start application test_app: Test.RootSupervisor.start(:normal, []) returned an error: shutdown: failed to start child: Test.Supervisor
    ** (EXIT) shutdown: failed to start child: Conn
        ** (EXIT) %FunctionClauseError{module: Conn, function: :handle_result, arity: 2, kind: nil, args: nil, clauses: nil}

So it looks like it's Walex specific.
Thanks for the assistance

hello
In the end I managed to reproduce with the minimum amount of code possible
I created this repo to host the test project: https://github.com/DaemonSnake/replication_slot_test_app/tree/main
I tried to remove as many variables as possible and only rely on postgrex 1.75.5 with minimal amount of configuration.
Am still getting this low-detail errors:

** (Mix) Could not start application test_app: exited in: TestApp.Sup.start(:normal, [])
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, {:shutdown, {:failed_to_start_child, Conn, :function_clause}}}
            (test_app 0.1.0) lib/test_app.ex:11: TestApp.Sup.start/2
            (kernel 9.1) application_master.erl:293: :application_master.start_it_old/4

That's cause by your own app logic. On this line, you assume it returns ok, but it may not: https://github.com/DaemonSnake/replication_slot_test_app/blob/main/lib/test_app/sup.ex#L11

I see, if I change this to:

    case Supervisor.start_link(children, strategy: :one_for_one) do
      {:ok, pid} ->
        Logger.info("Supervisor started")
        {:ok, pid}

      {:error, reason} ->
        Logger.error("Supervisor failed to start: #{inspect(reason)}")
        ...
    end

I still get very low amount of information:

{:shutdown, {:failed_to_start_child, TestApp.Conn, :function_clause}}

You are right. For some reason the failure that happens is not including the stacktrace. Can you please try this running iex -S mix and then run this:

iex(2)> Postgrex.ReplicationConnection.start_link(:foo, [], [])

What do you get printed? I get this:


12:15:46.144 [error] :gen_statem #PID<0.1088.0> terminating
** (UndefinedFunctionError) function :foo.init/1 is undefined (module :foo is not available)
    :foo.init([])
    (postgrex 0.17.5) lib/postgrex/replication_connection.ex:452: Postgrex.ReplicationConnection.init/1
    (stdlib 5.0.2) gen_statem.erl:966: :gen_statem.init_it/6
    (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Queue: []
Postponed: []
State: :undefined
Data: :undefined
Callback mode: :state_functions, state_enter: false

very odd:

Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Interactive Elixir (1.15.7) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Postgrex.ReplicationConnection.start_link(:foo, [], [])
{:error, :undef}

It is weird that you don't get any report, the linked process should at least crash, but the good news is that we have added reports on Elixir v1.17 (the next version, the one I am using) and this problem will be solved once it is out. :) So I think we are good!

Super, thank you very much 🙌
Closing