MatchError: {:error, {:already_started, #PID<0.18210.10>}}

Question

MatchError: {:error, {:already_started, #PID<0.18210.10>}}

Closed this issue 6 years ago · 33 comments

We have Kadabra running as our http2 client for pigeon (as is the default) and we get sporadic bursts of errors of this form. The raw stack trace is:

  Elixir.MatchError: no match of right hand side value: {:error, {:already_started, #PID<0.18210.10>}}
  File "lib/connection/flow_control.ex", line 146, in Kadabra.Connection.FlowControl.process/2
  File "lib/connection.ex", line 227, in Kadabra.Connection.do_send_headers/3
  File "lib/connection.ex", line 109, in Kadabra.Connection.handle_cast/2
  File "gen_server.erl", line 616, in :gen_server.try_dispatch/4
  File "gen_server.erl", line 686, in :gen_server.handle_msg/6
  File "proc_lib.erl", line 247, in :proc_lib.init_p_do_apply/3
  Module "Elixir.Kadabra.Connection", in Kadabra.Connection.init/1

My guess is there is some sort of race condition with connection creation but I don't know enough about the library to give an informative guess. Any advice?

Answer 1 · 2018-01-25T20:05:23.000Z

Yeah, that would indicate new streams are being created with duplicate stream IDs. What version of kadabra are you running?

Answer 2 · 2018-01-25T20:09:58.000Z

From our mixfile

{:pigeon, "~> 1.1.4"}, {:kadabra, "~> 0.3.6"}

Answer 3 · 2018-01-26T16:10:37.000Z

I guess it's worth asking too, what does your pigeon config look like?

Answer 4 · 2018-01-26T16:16:41.000Z

config :pigeon, :apns, apns_default: %{ cert: "/some/creds/path", key: "/some/key/path", mode: :prod, use_2197: true }

We currently only use pigeon for apns as we only have a supported iOS app.

Answer 5 · 2018-01-26T16:21:41.000Z

Weird, that should be the ideal use case. What about your elixir and OTP versions?

Answer 6 · 2018-01-26T16:35:23.000Z

elixir 1.5.2, OTP 20.1.7

Answer 7 · 2018-01-29T15:57:19.000Z

Still confused why it's happening. Haven't had any issues testing with those versions yet. How frequently does this occur and how many pushes are you normally sending at once?

Answer 8 · 2018-01-29T16:19:23.000Z

We are a fairly high volume application, but our iOS footprint is still relatively small. It looks like our move to pigeon actually trampled our push metrics, so I'll try to pull them up for you to get a better idea.

As far as the error rate, there's a lot of variance, between 2-800 times a day. There seems to be a compounding effect, where you'll get hundreds in bursts but I don't know the underlying flaw causing it.

Answer 9 · 2018-01-29T17:00:15.000Z

As far as concurrent push notifications, it's entirely possible we send hundreds simultaneously. We have a lot of fanout in our notifications.

Answer 10 · 2018-01-29T18:23:30.000Z

I just realized your config is wrong, though I'm not sure if it's related to the issue. use_2197: true has been renamed to port: 2197. I need to update the corresponding docs.

Answer 11 · 2018-01-29T18:24:12.000Z

Oh interesting. Would that have any impact on this bug though?

Answer 12 · 2018-01-29T18:25:37.000Z

Maybe, if it's impacting the kadabra socket configuration. I'll try testing some more. Perhaps in the meantime try removing the port config and see if that resolves it.

Answer 13 · 2018-01-31T03:49:49.000Z

The issue recurred after changing the port config

Answer 14 · 2018-01-31T04:46:51.000Z

Another thing I neglected to mention is our runtime. We run on docker in an alpine linux image, using distillery releases. Not sure if the OS could affect the stability of HTTP2 in any way.

Answer 15 · 2018-02-01T06:52:07.000Z

We are currently seeing the same thing. Just as @michaeljguarino said, we will get many of those errors in bursts.

We start our connections manually in a genstage consumer:

  def init(:ok) do
    {:ok, apns_conn} = Pigeon.APNS.start_connection(cert: "priv/cert.pem", key: Application.get_env(:my_app, :apns)[:key], mode: :prod)
    {:ok, fcm_conn} = Pigeon.FCM.start_connection(key: Application.get_env(:my_app, :fcm)[:key])
    {:consumer, %{apns: apns_conn, fcm: fcm_conn}}
  end

Also seeing the following error somewhat frequently (not sure if related) --

(kadabra) lib/hpack.ex:42: Kadabra.Hpack.handle_call/3
(hpack) /app/deps/hpack/src/hpack.erl:206: :hpack.decode_literal_header_never_indexed/3
** (MatchError) no match of right hand side value: :undefined
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
GenServer {Registry.Kadabra, {#Reference<0.3248228533.436207621.227488>, :decoder}} terminating

If relevant, we are also running our app via docker in an alpine image.

Answer 16 · 2018-02-01T16:03:56.000Z

That second error would indicate something really serious with the connection, as in, the socket is receiving data out of order. I'll have to bring up an alpine image to test further.

Answer 17 · 2018-02-01T17:04:24.000Z

Yeah we see the HPack errors also, but in nowhere near the volume of the connection already started errors.

Answer 18 · 2018-02-06T18:53:20.000Z

Have you made any progress on this? Is there anything we can do to help out?

Answer 19 · 2018-02-07T22:22:31.000Z

I still have yet to recreate the bug for myself. If it's a race condition of some kind, I'm hoping it may inadvertently get resolved with some of the structure changes in v0.4.

@msmykowski confirmed the issue persists outside of Docker on Ubuntu.

I'd like to see if I can recreate the environment exactly. What OS are you running and where are you hosted?

Answer 20 · 2018-02-07T23:22:01.000Z

We're running on ECS, using one of Amazon's "ECS optimized" AMIs. I'm not 100% positive, but odds on it's running Amazon Linux.

We're also running entirely within an Amazon VPC, if that helps too.

Answer 21 · 2018-02-15T09:24:38.000Z

We have the same problem. Running kadabra with pigeon in alpine, erlang 20 and Elixir 1.5.0

Answer 22 · 2018-02-15T18:03:52.000Z

Still haven't seen it crop up in production or any of my testing, but I'm still fairly certain its getting caused by duplicate stream IDs somewhere.

I've set up a PR (#29) that will log out state if it encounters the error. Give it a try and tell me what you get.

{:kadabra, github: "codedge-llc/kadabra", branch: "fix/already-started", override: true}

Answer 23 · 2018-02-15T18:28:15.000Z

I'll try to swap the dependency by end of day, and will hand over any errors we see when they come up.

Answer 24 · 2018-02-16T22:13:10.000Z

It's something I've been considering. Push promises shouldn't ever be received in this context as they don't conceptually make sense for how push notifications are sent (same name, but totally different things). I haven't encountered them in the 2.5 years I've been developing pigeon.

The stream isn't incremented because push promises define their own stream identifier. If push promises are getting received on the connection, however, I'm going to make some significant architectural changes in pigeon.

Let me go ahead and update the PR to log out if any promises get received. It's possible APNS or FCM servers have recently changed.

Answer 25 · 2018-02-18T01:16:35.000Z

@hpopp I created a branch to work backwards through the errors. I create an arbitrary ref when encoding in the Hpack module.

This generates the first error I was seeing --

17:06:07.325 [error] GenServer {Registry.Kadabra, {#PID<0.574.0>, Kadabra.Connection}} terminating
** (stop) exited in: :gen_statem.call(#PID<0.591.0>, {:send_headers, [{":method", "POST"}, {":path", "/3/device/facce77551eba2d6b2755cbde270bf50abeaddd6896872e49d14d0dbfa57a2c1"}, {"content-length", "123"}, {"apns-topic", "com.bleacherreport.TeamStream"}], "{\"tag\":\"dodgeball\",\"aps\":{\"sound\":\"default\",\"alert\":{\"title\":null,\"body\":\"This is the alert body\"}},\"a\":\"analytics string\"}"}, :infinity)
    ** (EXIT) exited in: GenServer.call({:via, Registry, {Registry.Kadabra, {#Reference<0.0.4.916>, :encoder}}}, {:encode, [{":authority", "api.development.push.apple.com"}, {":method", "POST"}, {":path", "/3/device/facce77551eba2d6b2755cbde270bf50abeaddd6896872e49d14d0dbfa57a2c1"}, {":scheme", "https"}, {"apns-topic", "com.bleacherreport.TeamStream"}, {"content-length", "123"}]}, 5000)
        ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Interestingly enough, as the pushes continues, I then see the already_started error --

17:06:07.381 [error] GenServer {Registry.Kadabra, {#PID<0.584.0>, Kadabra.Connection}} terminating
** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.598.0>}}
    (kadabra) lib/connection/flow_control.ex:147: Kadabra.Connection.FlowControl.process/2
    (kadabra) lib/connection.ex:227: Kadabra.Connection.do_send_headers/3
    (kadabra) lib/connection.ex:109: Kadabra.Connection.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

What I seem to be seeing is nothing ever returns from this call -- https://github.com/codedge-llc/kadabra/blob/master/lib/connection/flow_control.ex#L149, and subsequently the stream_id is never incremented. This causes the already_started error on the next push.

Incrementing the stream_id before making the call to the stream process mitigates the error, although I do not think that is the root of the problem.

One thought I had is that this error --

06:37:59.034 [error] ** State machine {Registry.Kadabra, {#Reference<0.2238806796.3053715460.107751>, 39}} terminating
** Last event = {:cast,
 {:recv,
  %Kadabra.Frame.Headers{end_headers: true, end_stream: false, exclusive: nil,
   header_block_fragment: <<140, 31, 47, 156, 192, 211, 173, 94, 134, 246, 21,
     156, 127, 13, 213, 155, 186, 219, 75, 60, 40, 64, 179, 176, 223, 135, 11,
     111, 57, 125, 225, 95>>, priority: false, stream_dependency: nil,
   stream_id: 39, weight: nil}}}
** When server state  = {:open,
 %Kadabra.Stream{body: "", buffer: "", connection: #PID<0.2125.0>,
  flow: %Kadabra.Stream.FlowControl{max_frame_size: 16384, queue: [],
   stream_id: 39, window: 65405}, headers: [], id: 39,
  ref: #Reference<0.2238806796.3053715460.107751>, scheme: :https,
  settings: nil,
  socket: {:sslsocket, {:gen_tcp, #Port<0.69649>, :tls_connection, :undefined},
   #PID<0.2129.0>}, uri: 'api.development.push.apple.com'}}
** Reason for termination = :exit:{{{:badmatch, :undefined},

may be crashing the hpack process, resulting in the other errors.

Here is the branch if you would like to test for yourself -- https://github.com/msmykowski/kadabra/tree/replicate-errors

Answer 26 · 2018-02-18T23:35:37.000Z

@hpopp on further investigation, it looks like the stream supervisor is decoupled from the connection supervisor (one for one strategy), however they share the same ref when started in supervisor -- https://github.com/codedge-llc/kadabra/blob/master/lib/supervisor.ex#L18.

The connection is coupled with the hpack processes, using a one-for-all strategy-- https://github.com/codedge-llc/kadabra/blob/master/lib/connection_supervisor.ex#L29. This means that if the connections were to crash for any reason (as we have been seeing with the hpack errors), the connection will be restarted with the same ref, but stream_id will be set back to 1. The stream supervisor however, is still holding on to the previously existing streams. This causes the already_started error, as stream_id=1 already exists for the stream supervisor for that ref.

Answer 27 · 2018-02-19T04:58:55.000Z

Thanks for digging into this-- seems like it would be wise to go ahead make StreamSupervisor a child of ConnectionSupervisor. I'll put out a patch with the changes.

If either hpack table is ever corrupted, it's best to drop everything related to the connection and restart. Ideally the hpack encoding/decoding should never fail, so we need to figure out why it's happening. I'm assuming APNS and FCM servers are encoding their headers properly. Unfortunately hpack management is the only functionality in kadabra that I didn't write (it uses joedevivo's package).

Answer 28 · 2018-02-19T22:36:11.000Z

This and #30 should most likely be fixed with the changes in #31. I suspect there was a race condition for header decoding on finished streams. Now everything that touches the hpack table has been converted to a :gen_statem.call/2.

This implementation will be slightly slower, but with the window update changes, it should remain roughly the same (or slightly faster).

Give the PR a try and see if any new errors crop up. The changes should be good for a release.

Answer 29 · 2018-02-20T21:06:37.000Z

We'll release it to production today and post back if there are any improvements. We were seeing a the hpack errors too, precipitating the already started errors in far larger quantity, so it seems plausible as the root cause.

Answer 30 · 2018-03-08T19:34:49.000Z

We haven't noticed this error recurring, but it seems to have been somewhat replaced by uncaught erlang errors of the form:

** (exit) exited in: :sys.terminate(#PID<0.23116.26>, :normal, :infinity) ** (EXIT) normal

It occurs in Kadabra.Connection.maybe_reconnect. Probably has something to do with the task stopping logic there, but not sure, the stack trace is very short:

exit: ** (exit) exited in: :sys.terminate(#PID<0.23116.26>, :normal, :infinity)
    ** (EXIT) normal
  File "lib/task/supervised.ex", line 116, in Task.Supervised.exit/4
  File "proc_lib.erl", line 247, in :proc_lib.init_p_do_apply/3
  Module "Elixir.Kadabra.Connection", in anonymous fn/0 in Kadabra.Connection.maybe_reconnect/1

Answer 31 · 2018-03-09T14:38:42.000Z

maybe_reconnect/1 is going away entirely in the v0.4 release. That error is happening because both :ssl_closed and recv GOAWAY try to shut down the connection with a supervised task, so one of the tasks is getting killed prematurely. The error is harmless, just verbose. I'll see if there's an easy patch.

Answer 32 · 2018-03-09T15:02:15.000Z

If a new release will eliminate it, I'm totally fine waiting for that, but if you put in a patch I'll gladly deploy it to verify on our end.

Answer 33 · 2018-05-25T17:46:07.000Z

Pigeon v1.2.0 and kadabra 0.4.2 are now out, so all of these issues should now be fixed. Will reopen if anything comes up.