absinthe-graphql/absinthe_phoenix

Pubsub not configured Error

maennchen opened this issue ยท 30 comments

Sentry catches a lot of RuntimeErrors saying Pubsub not configured! Subscriptions require a configured pubsub module..

I could not find out in which cases it is failing, i just see around 600 errors over a timespan of 30 days. I've never experienced the error myself and it only occurs in a low percentage of all the subscriptions.

Full Exception

Elixir.RuntimeError: Pubsub not configured!

Subscriptions require a configured pubsub module.

  File "lib/absinthe/phase/subscription/subscribe_self.ex", line 105, in Absinthe.Phase.Subscription.SubscribeSelf.ensure_pubsub!/1
  File "lib/absinthe/phase/subscription/subscribe_self.ex", line 20, in Absinthe.Phase.Subscription.SubscribeSelf.do_subscription/3
  File "lib/absinthe/pipeline.ex", line 274, in Absinthe.Pipeline.run_phase/3
  File "lib/absinthe/phoenix/channel.ex", line 98, in Absinthe.Phoenix.Channel.run/4
  File "lib/absinthe/phoenix/channel.ex", line 59, in Absinthe.Phoenix.Channel.handle_in/3
  File "lib/phoenix/channel/server.ex", line 244, in anonymous fn/4 in Phoenix.Channel.Server.handle_info/2
  File "lib/acme_api/endpoint.ex", line 1, in AcmeApi.Endpoint.instrument/4
  File "gen_server.erl", line 616, in :gen_server.try_dispatch/4
  File "gen_server.erl", line 686, in :gen_server.handle_msg/6
  File "proc_lib.erl", line 247, in :proc_lib.init_p_do_apply/3
  Module "Elixir.Phoenix.Channel.Server", in Phoenix.Channel.Server.init/1

Versions

absinthe_phoenix: 1.4.3
absinthe: 1.4.13
phoenix: 1.3.4

Setup

*Absinthe.Subscription added to Application

  • Use of Absinthe.Phoenix.Socket in Socket Implementation
  • Router Forward to Absinthe.Plug (via ApolloTracing.Pipeline)

I am having the same issue.

This may be related to phoenixframework/phoenix#3179

Chris' reply to this is helpful but not the full story, since Absinthe.Phoenix right now is written to expect you to do use Absinthe.Phoenix.Endpoint in your endpoint module. The code there uses the Endpoint to lookup various pieces of useful information. If we're going to decouple this from the Endpoint we'll need a new module.

In the meantime I highly suggest implementing "readiness" checks alongside your heath checks, where you don't route traffic to your node until it's fully booted.

@benwilson512 are there any recent recommendations on this?

Since I upgraded to Phoenix 1.5, I see the many messages like this in my logs when my server shuts down:

GenServer #PID<0.4431.0> terminating
** (RuntimeError) Pubsub not configured!

Subscriptions require a configured pubsub module.

    (absinthe 1.5.1) lib/absinthe/phase/subscription/subscribe_self.ex:109: Absinthe.Phase.Subscription.SubscribeSelf.ensure_pubsub!/1
    (absinthe 1.5.1) lib/absinthe/phase/subscription/subscribe_self.ex:20: Absinthe.Phase.Subscription.SubscribeSelf.do_subscription/3
    (absinthe 1.5.1) lib/absinthe/pipeline.ex:368: Absinthe.Pipeline.run_phase/3
    (absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:123: Absinthe.Phoenix.Channel.run/4
    (absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:91: Absinthe.Phoenix.Channel.run_doc/4
    (absinthe_phoenix 2.0.0) lib/absinthe/phoenix/channel.ex:57: Absinthe.Phoenix.Channel.handle_in/3
    (phoenix 1.5.3) lib/phoenix/channel/server.ex:316: Phoenix.Channel.Server.handle_info/2
    (stdlib 3.12.1) gen_server.erl:637: :gen_server.try_dispatch/4

The error seems to happen for my already-connected clients. Not new clients connecting while the app is being shut down.

CleanShot 2020-06-19 at 16 38 29@2x

The client is connected to the socket, and the socket itself is fine. The highlighted line seems to be when Absinthe.Subscription is shut down. The client, still connected to the same socket, then tries to rejoin over and over again, but Absinthe.Subscription isn't running so it gets rejected every time.

Only when Phoenix.Endpoint closes the WebSocket connection will the client try to reconnect to a new socket and there they get to a healthy server.

Any guidance would be much appreciated. Thanks for Absinthe! :)

@sebastianseilund What order do you have things in your supervision tree? Have you considered using something like https://github.com/derekkraan/ranch_connection_drainer ? Ideally you don't keep routing traffic to a node that is shutting down.

Order:

        {Phoenix.PubSub, pubsub_config()},
        UserflowWeb.Endpoint,
        {Absinthe.Subscription, UserflowWeb.Endpoint},

If I put Absinthe.Subscription before UserflowWeb.Endpoint it won't start.

I'm not routing any new traffic to my nodes once shutdown starts. As described above, the error happens in the sockets that are already connected (via WebSocket). This is the order of things:

  1. Client is connected to node via WebSocket and has joined an Absinthe channel. All is well.
  2. Kubernetes decides to delete the pod.
  3. Kubernetes removes the pod from all Services (should not route any new traffic to it)
  4. Kubernetes sends SIGTERM to the application. (I know this happens after (3) due to a preStop sleep 5 command)
  5. Absinthe.Subscription is shut down.
  6. That (immediately?) causes the above error in the channel that's already running (since Endpoint has not been shut down yet)
  7. Client is still connected to the socket and rejoins the same channel, but it fails again due to Pubsub not configured!
  8. (7) happens several times until Endpoint is finally shut down and the client is disconnected.
  9. The client reconnects via a fresh WebSocket connection and gets to a healthy server it can join the channel on. All is well again.

I haven't looked at https://github.com/derekkraan/ranch_connection_drainer before. Would you put that after Absinthe.Subscription?

We are seeing the exact same error that @sebastianseilund posted above, for very similar reasons.

@benwilson512 I looked into https://github.com/derekkraan/ranch_connection_drainer. It waits for connections to end before letting Endpoint shut itself down. So I don't think it'll fix the issue here since the WebSocket connections stay open.

Fair. I think the best option here would be to figure out a way to have the supervision tree look like this:

[
Pubsub,
Absinthe.Subscription,
Endpoint
]

Then the endpoint would shutdown before the pubsub. Let me see what I can do.

I think that would be amazing ๐Ÿ™

@benwilson512 Is there any movement on this? I am seeing ~150 of these errors daily on my production.

@lstrzebinczyk the best solution here is to use a load balancer that routes traffic away from your node on deploy. Even if I split out the pubsub from the endpoint, it will simply result in different issues for your clients. If you aren't routing traffic away from your node on deploy then when they are disconnected they may try to reconnect to the same node which will now fail.

I will try to find time to make the proposed improvements here this week, but it is merely going to help with error messages, it isn't going to solve the root issue. The root issue is that you are not disconnecting traffic properly from a node that is going down.

If you aren't routing traffic away from your node on deploy then when they are disconnected they may try to reconnect to the same node which will now fail.

This doesn't happen. Kubernetes does route new traffic away from pods before shutting down the Elixir container. The existing network connections are not terminated by anything in Kubernetes though.

When the client reconnects, they'll be sent to one of the new pods (or one of the old ones that haven't begun shutdown yet).

The root issue is that you are not disconnecting traffic properly from a node that is going down.

I don't think you'd want to terminate active connections outside of Elixir's control. If you have a regular slow HTTP request, you don't want it to be terminated half-way. That could leave stuff in a bad state, or at least ends up in an unnecessary 5xx error to the client. Just let it drain.

When MyApp.Endpoint is shut down, it will gracefully terminate the WebSocket connections. Only once this has happened, does it seem safe to shut down Absinthe.Subscription.

That's fair, there are definitely shutdown patterns that make sense for Elixir to lead. OK, I'll prioritize working on that this week.

any update on this?

Any updates?

I'm encountering this as well. I'm curious if there is an update on the horizon similar to what @benwilson512 seemed to have in mind in #46 (comment).

Experienced this today, and fixed it by adding a supervisor to my application.ex (might have missed it when setting up). Hope this helps.

def start(_type, _args) do 
    import Supervisor.Spec

    children = [
      AppWeb.Telemetry,
      {Phoenix.PubSub, name: App.PubSub},
      AppWeb.Endpoint,
      supervisor(Absinthe.Subscription, [AppWeb.Endpoint]) -> Added this
    ]
end

@alvnrapada's fix didn't seem to work on my application when I was testing by doing a rolling restart of a deployment in k8s... so I made this experimental "hack PR" #89 outlined by #46 (comment). Prior to making the change to my kubernetes deployed app indicated by this PR, I too was receiving the "pubsub not configured" error whenever I was triggering a rolling restart in kubernetes as a test.

I tried adding tests to the PR but Phoenix.ChannelTest helper pathways seemed to expect an ets backed endpoint configuration, and I wasn't quite sure the best way of tackling this at the moment (open to suggestions).

Testing notes:

  1. I tested the below on a demo app (nothing with any kind of traffic other than me trying to manually trigger subscriptions)
  2. I manually tested a simple query which triggered a subscription and a simple edit mutation which triggered another subscription using the /graphiql GUI in browser.
  3. Doing the rolling restart of a kubernetes deployement with the below no longer seemed to trigger the pubsub not configured error (at least as much as I could verify by live tailing the logs of a pod).

I will try and put together a demo application esp. if people think it might help in any way with testing (I'd also be curious as to results others might get), but here is the outline of what I did:

application.ex

  def start(_type, _args) do

    children = [
      # Start the Telemetry supervisor
      ExampleWeb.Telemetry,
      # Start the PubSub system
      # Start the Endpoint (http/https)
      {Phoenix.PubSub, name: Example.PubSub},
      Example.Presence,
      {Absinthe.Subscription, ExampleWeb.Pubsub.Endpoint},
      ExampleWeb.Endpoint
    ]
...# continued

user_socket.ex

defmodule ExampleWeb.UserSocket do
  use Phoenix.Socket
  use Absinthe.Phoenix.Socket, schema: ExampleWeb.Schema, endpoint: ExampleWeb.Pubsub.Endpoint

  ## Channels
  # channel "room:*", ExampleWeb.RoomChannel

  # Socket params are passed from the client and can
  # be used to verify and authenticate a user. After
  # verification, you can put default assigns into
  # the socket that will be set for all channels, ie
  #
  #     {:ok, assign(socket, :user_id, verified_user_id)}
  #
  # To deny connection, return `:error`.
  #
  # See `Phoenix.Token` documentation for examples in
  # performing token verification on connect.
  @impl true
  def connect(_params, socket, _connect_info) do
    {:ok, socket}
  end

  # Socket id's are topics that allow you to identify all sockets for a given user:
  #
  #     def id(socket), do: "user_socket:#{socket.assigns.user_id}"
  #
  # Would allow you to broadcast a "disconnect" event and terminate
  # all active sockets and channels for a given user:
  #
  #     ExampleWeb.Endpoint.broadcast("user_socket:#{user.id}", "disconnect", %{})
  #
  # Returning `nil` makes this socket anonymous.
  @impl true
  def id(_socket), do: nil
# omitted for brevity...
end

endpoint.ex

defmodule ExampleWeb.Endpoint do
  use Phoenix.Endpoint, otp_app: :example
  # use Absinthe.Phoenix.Endpoint

  # The session will be stored in the cookie and signed,
  # this means its contents can be read but not tampered with.
  # Set :encryption_salt if you would also like to encrypt it.
  @session_options [
    store: :cookie,
    key: "_Example_key",
    signing_salt: "sgWFmDOb"
  ]

  socket("/socket", ExampleWeb.UserSocket,
    websocket: [check_origin: false],
    longpoll: false
  )

  socket("/live", Phoenix.LiveView.Socket, websocket: [connect_info: [session: @session_options]])

#...omitted for brevity
end

new decoupled pubsub implementation file

defmodule ExampleWeb.Pubsub.Endpoint do
  use Absinthe.Phoenix.Subscription.Pubsub, otp_app: :example
end

whatever_file_where_a_subscription_is_published.ex

    :ok =
      Absinthe.Subscription.publish(ExampleWeb.Pubsub.Endpoint, %{users: user_list},
        new_users: "new_users"
      )

config.exs

config :example, ExampleWeb.Pubsub.Endpoint, pubsub_server: Example.PubSub

In terms of my k8 deployment.yaml config, it might also be worth noting I had a readiness probe and a terminationGracePeriod seconds of 65 like this (abbreviated yaml):

      terminationGracePeriodSeconds: 65
      containers:
        - name: example-web
          # the same name and tag must be used when building the image
          image: gcr.io/project_id_2200/example_web:prod
          imagePullPolicy: Always
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /lbcheck
              port: 5000
            initialDelaySeconds: 60

we've just started seeing this issue in the last week or so, wondering if we can help at all?

Has anyone been able to reproduce this issue in a locally running application? That might simplify testing any proposed fixes here.

It happens in our dev enviroment every time it restarts while running under watchexec, so I can test proposed solutions quite easily. @alvnrapada's fix did not work for me.

bfad commented

So here's what I'm trying to do to get the supervision tree to end with the endpoint:

The supervision tree setup:

    children = [
      {Phoenix.PubSub, [name: MyappWeb.PubSub, adapter: Phoenix.PubSub.PG2]},
      {Absinthe.Subscription, MyappWeb.Endpoint},
      MyappWeb.Endpoint,
    ]

The Endpoint for my app:

defmodule MyappWeb.Endpoint do
  # Important that this goes before `use Phoenix.Endpoint`
  use MyappWeb.OverridePhoenixEndpoint, otp_app: :myapp
  use Phoenix.Endpoint, otp_app: :myapp
  use Absinthe.Phoenix.Endpoint
# ...
end

The overriding for Phoenix.Endpoint.pubsub_server!/0

defmodule MyappWeb.OverridePhoenixEndpoint do
  defmacro __using__([otp_app: app_name]) do
    quote do
      defp pubsub_server! do
        server = if :ets.whereis(__MODULE__) != :undefined do
          config(:pubsub_server)
        else
          Application.get_env(unquote(app_name), __MODULE__) |> Keyword.get(:pubsub_server)
        end

        server || raise ArgumentError, "no :pubsub_server configured for #{inspect(__MODULE__)}"
      end
    end
  end
end

This seems to me to be working, but also seems like it's not the greatest idea in the world. Does anyone have any better suggestions or reasons why this is dangerous?

I'm seeing this error as well in my local environment. I'm attempting to capture process exit's to gracefully shutdown. When I issue a System.stop() in IEX to initiate a shutdown, I see a number of Absinthe errors. That may be a way to help reproduce it locally.

I was also able to reproduce sending a sigterm signal to the process running mix phx.server locally. My steps were:

  1. Start phoenix mix phx.server
  2. Subscribe to any graphql subscription
  3. Kill the process, pkill -f "mix phx.server" (MacOS)

For reference, sigterm is the signal sent by AWS ECS system to shutdown task containers. "Today, ECS always sends a SIGTERM" (https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/)

hoyon commented

I was also able to repro in our codebase by sending SIGTERM to the process while there's an active subscription.

Any news here? It seems that having Subscription and Endpoint causes race conditions and your Supervision tree ends up being unpredictable

This issue still occurs, though I only get one error every few days running in prod. Anyone found a "neat" fix?

there is (a bit dirty) fix that we use now in production: prosapient/absinthe@39e5a92

note: it handles SIGTERM (perfect for k8s) - just to understand when pubsub is not configured and when there is a progress with app shutdown.

@benwilson512 Any movement on this issue? We have been implementing more subscription-based functionality in our app and as a consequence we're seeing this PubSub failure a lot more now when users are connected to the socket during a restart. @bfad 's solution technically does work - but we get a lot of warning logs like

warning: this clause for pubsub_server!/0 cannot match because a previous clause at line 2 always matches
  lib/my_app_web/endpoint.ex:3

With that solution, killing the app (with :init.stop() in IEx) results in a log message like

18:48:34.553 [info] Shutting down 1 sockets in 1 rounds of 2000ms

before closing gracefully.