mtrudel/bandit

Headers being read from uncleared buffer

Closed this issue · 4 comments

Phoenix: 1.7.7
Bandit: 1.0.0-pre.10

We recently tried to use Bandit in our Phoenix application. However, we ran into a few issues. One of them seems to be that the body of the previous request (that might've crashed) is still stuck in the buffer of the current request, which causes reading of the headers to fail. Example log message:

GenServer #PID<0.61188.0> terminating
** (stop) "header read error: \"[\\\"35\\\",\\\"35\\\",\\\"conversations:XXX\\\",\\\"phx_join\\\",{\\\"user_id\\\":\\\"XXX\\\",\\\"reconnect\\\":true}]GET /socket/longpoll?frontend=web_widget&vsn=2.0.0&token=XXX HTTP/1.1\\r\\n\""
Last message: {:tcp, #Port<0.15766>, "[\"35\",\"35\",\"conversations:XXX\",\"phx_join\",{\"user_id\":\"XXX\",\"reconnect\":true}]"}

It looks like the LongPoll body from the previous request is still in the buffer while trying to parse the current GET, causing a crash. This log message seems to typically be preceded by this one:

GenServer #PID<0.64155.0> terminating
** (FunctionClauseError) no function clause matching in Bandit.HTTP1.Handler.handle_info/2
    (bandit 1.0.0-pre.10) /app/deps/thousand_island/lib/thousand_island/handler.ex:5: Bandit.HTTP1.Handler.handle_info({:DOWN, #Reference<0.29137531.1567883266.137509>, :process, #PID<96119.61130.0>, :normal}, {%ThousandIsland.Socket{socket: #Port<0.16007>, transport_module: ThousandIsland.Transports.TCP, read_timeout: 60000, span: %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.29137531.1567883266.137211>, start_time: -576455993151966621, start_metadata: %{remote_address: {1, 2, 3, 4}, remote_port: 53961, telemetry_span_context: #Reference<0.29137531.1567883266.137211>, parent_telemetry_span_context: #Reference<0.29137531.1563951106.108661>}}}, %{opts: %{http_1: [], http_2: [], websocket: []}, plug: {AppWeb.Endpoint, []}, handler_module: Bandit.HTTP1.Handler, http_1_enabled: true, http_2_enabled: true, websocket_enabled: true, requests_processed: 1}})
    (stdlib 5.0.2) gen_server.erl:1077: :gen_server.try_handle_info/3
    (stdlib 5.0.2) gen_server.erl:1165: :gen_server.handle_msg/6
    (stdlib 5.0.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: {:DOWN, #Reference<0.29137531.1567883266.137509>, :process, #PID<96119.61130.0>, :normal}

We haven't been able to track down yet which process in causing the DOWN message.

We've also seen a few of these:

WebSocket upgrade failed: error in upgrade_header check: \"Did not find 'websocket' in ''\"

The requests that trigger the above error seem to come from AppleWebKit, with Connection: Keep-Alive, and no Upgrade header.

Another thing of note that might help is that we're seeing a big increase in socket connections using the LongPoll transport. This might, however, be our own issue.

We're not using any custom config or custom plugs before the socket plug in our endpoints. The config is just simply adapter: Bandit.PhoenixAdapter.

It looks like you've got two (maybe three) separate issues here.

The first, about stale requests still being in the buffer when attempting to read a second one. The issue here is that body reads are lazy in Plug (Bandit doesn't read the body unless you tell it to). If your plug implementation doesn't read the body, we can end up in a situation you have, where the keep alive assumes the body has been read but it hasn't actually been read yet.

The fix to this is pretty easy; at the conclusion of the HTTP/1 request Bandit can just look to see if a body is indicated based on the headers (we already track this for other reasons), then read it (and immediately throw it on the floor) before turning another keepalive request.

The second, about bogus websocket upgrades. This is almost certainly because you have an HTTP request (not an upgrade request) hitting your Phoenix app's socket endpoint. Phoenix doesn't validate that upgrade requests are valid; it just assumes that anything hitting the websocket route within a socket endpoint is a valid upgrade, and fires it off the bandit/cowboy for upgrading. If non-upgrade HTTP requests manage to hit that URL you'll get exactly the error you see; I suspect that's what's happening here. The fix for this is to eagerly check the request at the time it is enqueued; this is currently pending in phoenixframework/websock_adapter#14.

The third issue is probably related to the second; if you're having a hard time upgrading connections to websocketsm they're now much more eager to fall back to long polling. The Phoenix folks changed a bunch of long pool fallbacks recently (Chris talked about it in his keynote at ElixirConf a few weeks back) and you may be seeing the impact of this.

Takeaways:

  • I'll file a separate issue (for cleanliness) to improve Bandit's handling of unread bodies when turning keep alive loops
  • I'll investigate if there's something funky with websocket upgrades and long polling in new Phoenix versions, with a particular focus on Apple browsers

Thanks!

In the meantime, I'll close this issue for hygiene since there are (or shortly will be) issues tracking more specific aspects of it.

Thanks again!

Thanks for the response. About the bogus upgrades, a simple

curl http://localhost:4000/socket/websocket

returns HTTP 400 but also terminates the handler process with a red

[error] GenServer #PID<0.14354.0> terminating
** (stop) "WebSocket upgrade failed: error in http_version check: :\"HTTP/1.0\""

in the console; not sure if this is by design but my feeling is this could be caught nicer and keep the handler process alive? (e.g. an OPTIONS request to the same endpoint does not crash the handler genserver)

That is (unfortunately) by design. The way upgrades work is as follows:

  • Phoenix (or your app, if you're using raw Plugs) signals a desire to upgrade the connection via WebSockAdapter.upgrade/3.
  • This only signals to the underlying server that an upgrade should take place. The upgrade process doesn't start until after the Plug.call/2 call has completed and control has passed back into the underlying server (this is true both in Bandit and Cowboy). By this point, if the server encounters an error with the upgrade (such as your bogus curl request above), there's no way to signal this to the developer other than logging
  • The fix here is to eagerly validate that the upgrade indicates a valid request at the time the application makes the upgrade request. This work is being done in phoenixframework/websock_adapter#14 and is nearly ready to go.