mtrudel/bandit

Request Line Read Errors

andyleclair opened this issue · 28 comments

Hi!
First off, thank you for Bandit, I really like using it for my personal projects!

However, we at Appcues tried to drop Bandit in as a replacement for Cowboy in our main (Phoenix 1.7) production app, but we immediately noticed a flood of crashes that we never saw under Cowboy.

All of the failures look like this:

GenServer #PID<0.25584.4> terminating
** (stop) "request line read error: \"04.197.85.152\\r\\n\""
Last message: {:tcp, #Port<0.4180>, "04.197.85.152\r\nX-Forwarded-Proto: https\r\nX-Forwarded-Port: 443\r\nHost: api.appcues.com\r\nX-Amzn-Trace-Id: Root=1-65bbbdc5-59ea15af3cb64254080c8a58\r\nContent-Length: 0\r\nAuthorization: Basic REDACTED\r\nAccept: application/json, text/json, text/x-json, text/javascript, application/xml, text/xml\r\nUser-Agent: RestSharp/108.0.1.0\r\ntraceparent: 00-29a1491680b1d6a4e28bea9336cfaf05-d177b554a1ca7845-00\r\nAccept-Encoding: gzip\r\nContent-Type: application/ndjson\r\n\r\n"}

Happy to help troubleshoot and investigate! I've got hundreds of Sentries with errors like these if more context would be helpful. It's also entirely possible that these requests are well and truly bogus and cowboy was just ignoring them, but I'm not sure. It appears to be some sort of encoding issue?

FWIW We're using Elixir 1.14.1 and OTP-25

It's possible that you were seeing these all the time with Cowboy, since it generally does everything silently (this has been the source of innumerable issues here, with people now seeing errors things that were previously silently ignored).

But that doesn't look like what's happening here. What you're showing looks an awful lot like Bandit attempting to start parsing an otherwise valid HTTP/1 request halfway through (specifically, the text there is obviously headers, but Bandit is expecting to be seeing the GET /path HTTP/1.1 line that starts off all HTTP requests). There could be a number of reasons for this. Some questions:

  1. What version of Bandit & Thousand Island is this (from you mix.lock file)
  2. Can you provide the context of the logs for a few seconds before this?
  3. If you merge the following bit of config into your phoenix config:
http(s): [
  http1_options: [max_requests: 1],
]

does the problem go away?

The ndjson content type suggests that this is a long poll request, so it's likely that this is a keep alive request (ie: it's not the first request the client has sent on this socket connection). On the other hand, the restsharp user agent makes me think.... maybe it's not?

I have a hunch I know what it might be; I've been wondering about this for a while and I think you may have just given me a clue. Do you by any chance know what this is calling in to in Phoenix, and what type of response it is? (standard HTTP controller, channel long polling, etc?)

Moreover, since it looks like you may have control over the client here given its user-agent, do you know:

  • what requests it's making (specifically, the method)
  • whether you're using keepalive
  • if you're sending request bodies, what transfer method you're using (content-length encoding, chunked encoding, etc)

Sorry for the slow response, I was on baby duty 😅. Some of these requests we do control, some we don't. Some of them come from another Phoenix app we have and they communicate to the app that error was coming from over HTTP2 from Finch. We do GET and POST, it looks like mostly POST though, I'm guessing on GETs. I don't think we specify the transfer method, so, whatever is the default in Finch, and therefore Mint? These are Phoenix controller endpoints fwiw.

The endpoints we saw this behavior on were prod endpoints, so I'm sorta wary about rolling it back out. If you have a scenario you'd like me to test, I'm happy to try doing it locally or on one of our staging instances?

Here's an interesting tidbit I saw in the logs, I'm not sure if this is helpful
image

Not sure but I have a feeling that this is similar to what we experienced with stripe library miss-handling conn token assignments in here: beam-community/stripity-stripe#821

@andyleclair can you quickly check to see if you are using Plug.Conn.read_body? (it might not be a direct usage in your code though and can be something that a library is doing)

Other option to confirm is the trick that @mtrudel mentioned with max_requests: 1 to see if this actually is related to keep-alive connections.

Those logs are super helpful; the 'chunked_encoding' functions in your latest screenshot are key!

What looks to be happening is that the client is POSTing a chunked encoding request body and Bandit is having an issue decoding it properly. If this happens, then Bandit can read too much / not enough of the request body, and so when the client reuses that connection via keep-alive, Bandit goes looking for the request line (eg: for GET /path HTTP/1.1), and either sees the end of the previous request's POST body or mid-way through headers.

The next question is whether this is an issue with Bandit or with the client library. The most useful thing to determine would be exactly what client library is making these connections, and attempting to see what debug info we can get out of that end (if it's possible for you to reproduce this on demand, you can also capture those logs for Bandit).

Other option to confirm is the trick that @mtrudel mentioned with max_requests: 1 to see if this actually is related to keep-alive connections.

100% it is

This is a corner of the spec that it's really important to be careful with

@andyleclair can you quickly check to see if you are using Plug.Conn.read_body? (it might not be a direct usage in your code though and can be something that a library is doing)

Knowing the exact version of Bandit / Thousand Island from your mix.lock would be useful. Specifically, some work went out way back in 1.0.0-pre.17 that addressed exactly this.

This was on latest! We were using Bandit 1.1.3 and Thousand Island 1.3.2

This was on latest! We were using Bandit 1.1.3 and Thousand Island 1.3.2

Good to know! Next step is knowing the client library to see if it's us or them.

Those logs are super helpful; the 'chunked_encoding' functions in your latest screenshot are key!

What looks to be happening is that the client is POSTing a chunked encoding request body and Bandit is having an issue decoding it properly. If this happens, then Bandit can read too much / not enough of the request body, and so when the client reuses that connection via keep-alive, Bandit goes looking for the request line (eg: for GET /path HTTP/1.1), and either sees the end of the previous request's POST body or mid-way through headers.

The next question is whether this is an issue with Bandit or with the client library. The most useful thing to determine would be exactly what client library is making these connections, and attempting to see what debug info we can get out of that end (if it's possible for you to reproduce this on demand, you can also capture those logs for Bandit).

I can't be 100% sure of all clients, since this is an external facing API endpoint, however, I can be sure that some of the requests are internally generated, and those requests are Finch.

I'm actually willing to bet that all the requests were Finch. We don't have many clients that directly integrate with our older API, they mostly integrate with our v2 API, which you might guess talks back to the v1 API. Those should all be HTTP2, since I'm guessing Phoenix will just upgrade Finch, since they both support HTTP2

Unless you're Finch (or any other client) is connecting over HTTPS, it'll essentially never automatically upgrade to HTTP/2. You'd either need 'prior knowledge' or h2c, neither of which is in widespread usage (and specifically, neither of which is supported by finch). Is it HTTPS between your finch client and Bandit?

Another clue is it would be whoever might be POSTing to /v1/accounts/:account_id/users/:user_id/activity, per the screenshot.

Yeah, that's our v1 endpoint, but I can see from some of the errors in Sentry that they have a client header of mint which says to me it's coming from our v2 api. It's HTTPS between the client and the Amazon ALB which terminates SSL, but we connect to the ALB on the server side over bare HTTP

image

Not sure but I have a feeling that this is similar to what we experienced with stripe library miss-handling conn token assignments in here: beam-community/stripity-stripe#821

That's exactly it I bet @alisinabh!

@andyleclair, if you look at your controller implementations, I'd bet you're not carrying forward the conn as returned from a Plug.Conn.read_req_body or similar function (basically everything in Plug.Conn returns back a new conn that you're expected to use from there on).

I don't know for sure what controller it is; the POST handler for /v1/accounts/:account_id/users/:user_id/activity is a good candidate, but the controller that messed up is the one that handled the request that came before that one from that client. If you're logging 200's in your logs, it'd likely be the previous request you saw (or close to it if you've got lots of clients).

It's like we've been here before, @alisinabh !

Just to make sure I have this right: if I call something like Plug.Conn.fetch_query_parameters, which returns a Plug.Conn, I need to keep that Plug.Conn in scope for the rest of the lifetime of the outer call?

I mean that does make sense, and I have already found (I think) the place in our code where this happens. I guess if that's the thing I need to go fix, what are the consequences of it being like that now? Is it fine? Am I secretly mangling requests all the time and not knowing about it?

Am I secretly mangling requests all the time and not knowing about it?

Almost certainly, yes. Cowboy depends on this same convention

Just to make sure I have this right: if I call something like Plug.Conn.fetch_query_parameters, which returns a > Plug.Conn, I need to keep that Plug.Conn in scope for the rest of the lifetime of the outer call?

Yep, that's it:

Wrong:

def call(conn, _opts) do
  {:ok, body, _conn} = read_req_body(conn)
  put_resp(conn, 200, "OK")
end

Right:
def call(conn, _opts) do
  {:ok, body, conn} = read_req_body(conn)
  put_resp(conn, 200, "OK")
end

Remember: the functional paradigm generally avoid side effects, and you're creating side effects by reading & writing bytes to / from a conn structure. Those side effects are captured in the conn struct that gets returned to you (specifically, in this case, the bytes that we've read from the network previously but haven't been processed yet are maintained in a buffer inside that conn, so if you don't keep track, subsequent requests will just 'skip over' them).

Am I secretly mangling requests all the time and not knowing about it?

Almost certainly, yes. Cowboy depends on this same convention

https://i.imgur.com/Vl4Pzi8.gif

Well, I guess we can close this one, I've uhh gotta go patch some code

Thank you both so much for the help!

I'll leave this issue open until you verify you've fixed things. Also to get my brain thinking about possible ways we could try and detect this on Bandit's side

Because @alisinabh is a rock star and already started a conversation about how we can improve this over at #295, we don't need to keep this issue open for our purposes any longer.

@andyleclair can you confirm the suggestions above fix your problem so we can close this issue?

I haven't started rolling this fix out yet, so I can't 100% confirm right now, but as soon as you all told me the symptom, I immediately found where our code mishandles the conn. I'm happy to close this out and I can reopen later if need be