codedge-llc/kadabra

MatchError: {:error, {:already_started, #PID<0.18210.10>}}

Closed this issue · 33 comments

We have Kadabra running as our http2 client for pigeon (as is the default) and we get sporadic bursts of errors of this form. The raw stack trace is:

  Elixir.MatchError: no match of right hand side value: {:error, {:already_started, #PID<0.18210.10>}}
  File "lib/connection/flow_control.ex", line 146, in Kadabra.Connection.FlowControl.process/2
  File "lib/connection.ex", line 227, in Kadabra.Connection.do_send_headers/3
  File "lib/connection.ex", line 109, in Kadabra.Connection.handle_cast/2
  File "gen_server.erl", line 616, in :gen_server.try_dispatch/4
  File "gen_server.erl", line 686, in :gen_server.handle_msg/6
  File "proc_lib.erl", line 247, in :proc_lib.init_p_do_apply/3
  Module "Elixir.Kadabra.Connection", in Kadabra.Connection.init/1

My guess is there is some sort of race condition with connection creation but I don't know enough about the library to give an informative guess. Any advice?

hpopp commented

Yeah, that would indicate new streams are being created with duplicate stream IDs. What version of kadabra are you running?

From our mixfile

{:pigeon, "~> 1.1.4"}, {:kadabra, "~> 0.3.6"}

hpopp commented

I guess it's worth asking too, what does your pigeon config look like?

config :pigeon, :apns, apns_default: %{ cert: "/some/creds/path", key: "/some/key/path", mode: :prod, use_2197: true }

We currently only use pigeon for apns as we only have a supported iOS app.

hpopp commented

Weird, that should be the ideal use case. What about your elixir and OTP versions?

elixir 1.5.2, OTP 20.1.7

hpopp commented

Still confused why it's happening. Haven't had any issues testing with those versions yet. How frequently does this occur and how many pushes are you normally sending at once?

We are a fairly high volume application, but our iOS footprint is still relatively small. It looks like our move to pigeon actually trampled our push metrics, so I'll try to pull them up for you to get a better idea.

As far as the error rate, there's a lot of variance, between 2-800 times a day. There seems to be a compounding effect, where you'll get hundreds in bursts but I don't know the underlying flaw causing it.

As far as concurrent push notifications, it's entirely possible we send hundreds simultaneously. We have a lot of fanout in our notifications.

hpopp commented

I just realized your config is wrong, though I'm not sure if it's related to the issue. use_2197: true has been renamed to port: 2197. I need to update the corresponding docs.

Oh interesting. Would that have any impact on this bug though?

hpopp commented

Maybe, if it's impacting the kadabra socket configuration. I'll try testing some more. Perhaps in the meantime try removing the port config and see if that resolves it.

The issue recurred after changing the port config

Another thing I neglected to mention is our runtime. We run on docker in an alpine linux image, using distillery releases. Not sure if the OS could affect the stability of HTTP2 in any way.

We are currently seeing the same thing. Just as @michaeljguarino said, we will get many of those errors in bursts.

We start our connections manually in a genstage consumer:

  def init(:ok) do
    {:ok, apns_conn} = Pigeon.APNS.start_connection(cert: "priv/cert.pem", key: Application.get_env(:my_app, :apns)[:key], mode: :prod)
    {:ok, fcm_conn} = Pigeon.FCM.start_connection(key: Application.get_env(:my_app, :fcm)[:key])
    {:consumer, %{apns: apns_conn, fcm: fcm_conn}}
  end

Also seeing the following error somewhat frequently (not sure if related) --

(kadabra) lib/hpack.ex:42: Kadabra.Hpack.handle_call/3
(hpack) /app/deps/hpack/src/hpack.erl:206: :hpack.decode_literal_header_never_indexed/3
** (MatchError) no match of right hand side value: :undefined
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
GenServer {Registry.Kadabra, {#Reference<0.3248228533.436207621.227488>, :decoder}} terminating

If relevant, we are also running our app via docker in an alpine image.

hpopp commented

That second error would indicate something really serious with the connection, as in, the socket is receiving data out of order. I'll have to bring up an alpine image to test further.

Yeah we see the HPack errors also, but in nowhere near the volume of the connection already started errors.

Have you made any progress on this? Is there anything we can do to help out?

hpopp commented

I still have yet to recreate the bug for myself. If it's a race condition of some kind, I'm hoping it may inadvertently get resolved with some of the structure changes in v0.4.

@msmykowski confirmed the issue persists outside of Docker on Ubuntu.

I'd like to see if I can recreate the environment exactly. What OS are you running and where are you hosted?

We're running on ECS, using one of Amazon's "ECS optimized" AMIs. I'm not 100% positive, but odds on it's running Amazon Linux.

We're also running entirely within an Amazon VPC, if that helps too.

We have the same problem. Running kadabra with pigeon in alpine, erlang 20 and Elixir 1.5.0

hpopp commented

Still haven't seen it crop up in production or any of my testing, but I'm still fairly certain its getting caused by duplicate stream IDs somewhere.

I've set up a PR (#29) that will log out state if it encounters the error. Give it a try and tell me what you get.

{:kadabra, github: "codedge-llc/kadabra", branch: "fix/already-started", override: true}

I'll try to swap the dependency by end of day, and will hand over any errors we see when they come up.

hpopp commented

It's something I've been considering. Push promises shouldn't ever be received in this context as they don't conceptually make sense for how push notifications are sent (same name, but totally different things). I haven't encountered them in the 2.5 years I've been developing pigeon.

The stream isn't incremented because push promises define their own stream identifier. If push promises are getting received on the connection, however, I'm going to make some significant architectural changes in pigeon.

Let me go ahead and update the PR to log out if any promises get received. It's possible APNS or FCM servers have recently changed.

@hpopp I created a branch to work backwards through the errors. I create an arbitrary ref when encoding in the Hpack module.

This generates the first error I was seeing --

17:06:07.325 [error] GenServer {Registry.Kadabra, {#PID<0.574.0>, Kadabra.Connection}} terminating
** (stop) exited in: :gen_statem.call(#PID<0.591.0>, {:send_headers, [{":method", "POST"}, {":path", "/3/device/facce77551eba2d6b2755cbde270bf50abeaddd6896872e49d14d0dbfa57a2c1"}, {"content-length", "123"}, {"apns-topic", "com.bleacherreport.TeamStream"}], "{\"tag\":\"dodgeball\",\"aps\":{\"sound\":\"default\",\"alert\":{\"title\":null,\"body\":\"This is the alert body\"}},\"a\":\"analytics string\"}"}, :infinity)
    ** (EXIT) exited in: GenServer.call({:via, Registry, {Registry.Kadabra, {#Reference<0.0.4.916>, :encoder}}}, {:encode, [{":authority", "api.development.push.apple.com"}, {":method", "POST"}, {":path", "/3/device/facce77551eba2d6b2755cbde270bf50abeaddd6896872e49d14d0dbfa57a2c1"}, {":scheme", "https"}, {"apns-topic", "com.bleacherreport.TeamStream"}, {"content-length", "123"}]}, 5000)
        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Interestingly enough, as the pushes continues, I then see the already_started error --

17:06:07.381 [error] GenServer {Registry.Kadabra, {#PID<0.584.0>, Kadabra.Connection}} terminating
** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.598.0>}}
    (kadabra) lib/connection/flow_control.ex:147: Kadabra.Connection.FlowControl.process/2
    (kadabra) lib/connection.ex:227: Kadabra.Connection.do_send_headers/3
    (kadabra) lib/connection.ex:109: Kadabra.Connection.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

What I seem to be seeing is nothing ever returns from this call -- https://github.com/codedge-llc/kadabra/blob/master/lib/connection/flow_control.ex#L149, and subsequently the stream_id is never incremented. This causes the already_started error on the next push.

Incrementing the stream_id before making the call to the stream process mitigates the error, although I do not think that is the root of the problem.

One thought I had is that this error --

06:37:59.034 [error] ** State machine {Registry.Kadabra, {#Reference<0.2238806796.3053715460.107751>, 39}} terminating
** Last event = {:cast,
 {:recv,
  %Kadabra.Frame.Headers{end_headers: true, end_stream: false, exclusive: nil,
   header_block_fragment: <<140, 31, 47, 156, 192, 211, 173, 94, 134, 246, 21,
     156, 127, 13, 213, 155, 186, 219, 75, 60, 40, 64, 179, 176, 223, 135, 11,
     111, 57, 125, 225, 95>>, priority: false, stream_dependency: nil,
   stream_id: 39, weight: nil}}}
** When server state  = {:open,
 %Kadabra.Stream{body: "", buffer: "", connection: #PID<0.2125.0>,
  flow: %Kadabra.Stream.FlowControl{max_frame_size: 16384, queue: [],
   stream_id: 39, window: 65405}, headers: [], id: 39,
  ref: #Reference<0.2238806796.3053715460.107751>, scheme: :https,
  settings: nil,
  socket: {:sslsocket, {:gen_tcp, #Port<0.69649>, :tls_connection, :undefined},
   #PID<0.2129.0>}, uri: 'api.development.push.apple.com'}}
** Reason for termination = :exit:{{{:badmatch, :undefined},

may be crashing the hpack process, resulting in the other errors.

Here is the branch if you would like to test for yourself -- https://github.com/msmykowski/kadabra/tree/replicate-errors

@hpopp on further investigation, it looks like the stream supervisor is decoupled from the connection supervisor (one for one strategy), however they share the same ref when started in supervisor -- https://github.com/codedge-llc/kadabra/blob/master/lib/supervisor.ex#L18.

The connection is coupled with the hpack processes, using a one-for-all strategy-- https://github.com/codedge-llc/kadabra/blob/master/lib/connection_supervisor.ex#L29. This means that if the connections were to crash for any reason (as we have been seeing with the hpack errors), the connection will be restarted with the same ref, but stream_id will be set back to 1. The stream supervisor however, is still holding on to the previously existing streams. This causes the already_started error, as stream_id=1 already exists for the stream supervisor for that ref.

hpopp commented

Thanks for digging into this-- seems like it would be wise to go ahead make StreamSupervisor a child of ConnectionSupervisor. I'll put out a patch with the changes.

If either hpack table is ever corrupted, it's best to drop everything related to the connection and restart. Ideally the hpack encoding/decoding should never fail, so we need to figure out why it's happening. I'm assuming APNS and FCM servers are encoding their headers properly. Unfortunately hpack management is the only functionality in kadabra that I didn't write (it uses joedevivo's package).

hpopp commented

This and #30 should most likely be fixed with the changes in #31. I suspect there was a race condition for header decoding on finished streams. Now everything that touches the hpack table has been converted to a :gen_statem.call/2.

This implementation will be slightly slower, but with the window update changes, it should remain roughly the same (or slightly faster).

Give the PR a try and see if any new errors crop up. The changes should be good for a release.

We'll release it to production today and post back if there are any improvements. We were seeing a the hpack errors too, precipitating the already started errors in far larger quantity, so it seems plausible as the root cause.

We haven't noticed this error recurring, but it seems to have been somewhat replaced by uncaught erlang errors of the form:

** (exit) exited in: :sys.terminate(#PID<0.23116.26>, :normal, :infinity) ** (EXIT) normal

It occurs in Kadabra.Connection.maybe_reconnect. Probably has something to do with the task stopping logic there, but not sure, the stack trace is very short:

exit: ** (exit) exited in: :sys.terminate(#PID<0.23116.26>, :normal, :infinity)
    ** (EXIT) normal
  File "lib/task/supervised.ex", line 116, in Task.Supervised.exit/4
  File "proc_lib.erl", line 247, in :proc_lib.init_p_do_apply/3
  Module "Elixir.Kadabra.Connection", in anonymous fn/0 in Kadabra.Connection.maybe_reconnect/1
hpopp commented

maybe_reconnect/1 is going away entirely in the v0.4 release. That error is happening because both :ssl_closed and recv GOAWAY try to shut down the connection with a supervised task, so one of the tasks is getting killed prematurely. The error is harmless, just verbose. I'll see if there's an easy patch.

If a new release will eliminate it, I'm totally fine waiting for that, but if you put in a patch I'll gladly deploy it to verify on our end.

hpopp commented

Pigeon v1.2.0 and kadabra 0.4.2 are now out, so all of these issues should now be fixed. Will reopen if anything comes up.