mtrudel/bandit

Plug.BadRequestError and Plug.TimeoutError

maxguzenski opened this issue · 16 comments

HI.
I have a phoenix app running on docker at aws (behind a aws load balance) receiving 30k req/min and when I move from cowboy to bandit, it start to report a lot of (really, a lot) of BadRequestError/TimeoutError

With cowboy this error never happens, with bandit it happens hundreds of times per minute

{
  "level": "error",
  "message": "GenServer #PID<0.40962487.2> terminating\n** (Plug.BadRequestError) could not process the request due to client error\n (plug 1.15.3) lib/plug/parsers/json.ex:114: Plug.Parsers.JSON.decode/3\n (plug 1.15.3) lib/plug/parsers.ex:340: Plug.Parsers.reduce/8\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.plug_builder_call/2\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.\"call (overridable 3)\"/2\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.call/2\n (bandit 1.2.3) lib/bandit/pipeline.ex:103: Bandit.Pipeline.call_plug/2\n (bandit 1.2.3) lib/bandit/pipeline.ex:24: Bandit.Pipeline.run/6\n (bandit 1.2.3) lib/bandit/http1/handler.ex:33: Bandit.HTTP1.Handler.handle_data/3\nLast message: {:tcp, #Port<0.473366>, \"POST /api/graphiql HTTP/1.1\\r\\nX-Forwarded-For: 200.173.208.137\\r\\nX-Forwarded-Proto: https\\r\\nX-Forwarded-Port: 443\\r\\nHost: d4swing.com\\r\\nX-Amzn-Trace-Id: Root=1-65df5317-05b5fb5b6ce808a1637defec\\r\\nContent-Length: 1315\\r\\nsec-ch-ua: \\\"Chromium\\\";v=\\\"122\\\", \\\"Not(A:Brand\\\";v=\\\"24\\\", \\\"Google Chrome\\\";v=\\\"122\\\"\\r\\naccept: */*\\r\\ncontent-type: application/json\\r\\nsec-ch-ua-mobile: ?1\\r\\nuser-agent: Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Mobile Safari/537.36\\r\\nsec-ch-ua-platform: \\\"Android\\\"\\r\\norigin: https://d4swing.com\\r\\nsec-fetch-site: same-origin\\r\\nsec-fetch-mode: cors\\r\\nsec-fetch-dest: empty\\r\\nreferer: https://d4swing.com/home\\r\\naccept-encoding: gzip, deflate, br, zstd\\r\\naccept-language: pt-BR,pt;q=0.9\\r\\ncookie: _gcl_au=1.1.1799397319.1709121129; _gid=GA1.2.2108462410.1709121129; _d4_key=SFMyNTY.g3QAAAACbQAAAAdhdXRoX2lkYgEJbZNtAAAAB3VzZXJfaWRiAE3jSw.eFhebBj5uUq86A_r9Ygl8eT5VzobrJl_3cHsxxji2Rg; _ga_E7WQQT6Q2Z=GS1.1.1709134615.2.0.1709134615.60.0.0; _ga=GA1.1.661085438.1709121128\\r\\n\\r\\n\"}",
  "context": {
    "request_id": "F7gQ4LLMgHqz6Wf-s86B"
  }
}
{
  "level": "error",
  "message": "GenServer #PID<0.91546838.2> terminating\n** (Plug.TimeoutError) timeout while waiting for request data\n (plug 1.15.3) lib/plug/parsers/json.ex:110: Plug.Parsers.JSON.decode/3\n (plug 1.15.3) lib/plug/parsers.ex:340: Plug.Parsers.reduce/8\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.plug_builder_call/2\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.\"call (overridable 3)\"/2\n (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.call/2\n (bandit 1.2.3) lib/bandit/pipeline.ex:103: Bandit.Pipeline.call_plug/2\n (bandit 1.2.3) lib/bandit/pipeline.ex:24: Bandit.Pipeline.run/6\n (bandit 1.2.3) lib/bandit/http1/handler.ex:33: Bandit.HTTP1.Handler.handle_data/3\nLast message: {:tcp, #Port<0.467682>, \"POST /api/graphiql HTTP/1.1\\r\\nX-Forwarded-For: 187.90.201.8\\r\\nX-Forwarded-Proto: https\\r\\nX-Forwarded-Port: 443\\r\\nHost: d4swing.com\\r\\nX-Amzn-Trace-Id: Root=1-65df52c8-15f01f55552129af03347f25\\r\\nContent-Length: 633\\r\\nsec-ch-ua: \\\"Chromium\\\";v=\\\"106\\\", \\\"Google Chrome\\\";v=\\\"106\\\", \\\"Not;A=Brand\\\";v=\\\"99\\\"\\r\\naccept: */*\\r\\ncontent-type: application/json\\r\\nsec-ch-ua-mobile: ?1\\r\\nuser-agent: Mozilla/5.0 (Linux; Android 13; SM-A325M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Mobile Safari/537.36\\r\\nsec-ch-ua-platform: \\\"Android\\\"\\r\\norigin: https://d4swing.com\\r\\nsec-fetch-site: same-origin\\r\\nsec-fetch-mode: cors\\r\\nsec-fetch-dest: empty\\r\\nreferer: https://d4swing.com/likes/activities/167657060\\r\\naccept-encoding: gzip, deflate, br\\r\\naccept-language: pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7\\r\\ncookie: _gcl_au=1.1.1600902996.1701628329; _d4_key=SFMyNTY.g3QAAAACbQAAAAdhdXRoX2lkYgCPlp1tAAAAB3VzZXJfaWRiACuXiw.iXBkoo_hf-LtCgNEr3YNANZnenXlIl_M24HvXskjlhU; _gid=GA1.2.1357400890.1709095882; _ga=GA1.2.1479567157.1693768993; _ga_E7WQQT6Q2Z=GS1.1.1709134211.215.1.1709134515.60.0.0; _gat_gtag_UA_22521872_1=1\\r\\n\\r\\n\"}",
  "context": {
    "request_id": "F7gQzh-tTtzmCDEIeTUK"
  }
}
config :d4, D4Web.Endpoint,
  adapter: Bandit.PhoenixAdapter,
  http: [ip: {0, 0, 0, 0}, port: System.get_env("PORT") || 4000],
  url: [host: hostname, schema: "https", port: 443],
  secret_key_base: "<secret>",
  render_errors: [formats: [html: D4Web.ErrorHTML, json: D4Web.ErrorJSON], layout: false],
  pubsub_server: D4.PubSub,
  live_view: [signing_salt: "<secret>"],
  debug_errors: false,
  code_reloader: false,
  check_origin: false,
  force_ssl: false,
  server: true,
  root: ".",
  version: Mix.Project.config()[:version]

Any ideia why this happens ?

Sorry for the late response; I was on holiday last week.

I'm 99% certain that the timeouts are a secondary symptom of whatever the underlying issue is that's causing the BadRequestError. Are there no other log entries describing what those errors may be? Figuring those out is the next step here.

Is it possible for you to run GitHub branch refs for your mix dependencies (ie: if I cut a Plug fork that dumps info about the ignored error tuple at https://github.com/elixir-plug/plug/blob/v1.15.3/lib/plug/parsers/json.ex#L113, could you deploy it and report back?)

What type of load balancer is it, an AWS Network or AWS Application?

It is aws application lb, with http request and websocket (socket is working with no issue I thing).

I agree with you that TimeOutError and BadRequest are the same.

Yes, I can deploy and test a custom plug dep.

Looking at the plug_cowboy documentation I noticed that it has two telemetry that I was not logging which are "[:cowboy, :request, :exception]" and "[:cowboy, :request, :early_error]".

I'm going to log these errors and maybe the problem also exists in cowboy, it just wasn't being logged. I'll let you know what happened soon.

by the way.
client: is a react/graphql (apollo-client)
server: absinthe (elixir graphql)

Yes, I can deploy and test a custom plug dep.

Great. I'll get something cooked up tomorrow (EST) and let you know.

I'm going to log these errors and maybe the problem also exists in cowboy, it just wasn't > being logged. I'll let you know what happened soon.

It wouldn't be the first time. Cowboy generally fails silently in almost all cases, whereas Bandit quite deliberately makes lots of noise on errors. This has been the cause of innumerable issues, where people report things that they see on Bandit, which have actually been silently happening for ages without them knowing.

I'm not convinced that's happening here, mostly because this error is coming out of the middle of a plug call; if this were erroring on Cowboy, I suspect you'd have been seeing errors then as well. Definitely worth a check though; let me know if you see any error now that you're attached to those Cowboy telemetry hooks

no issue reported with "[:cowboy, :request, :exception]" and "[:cowboy, :request, :early_error]".

I'll wait for your version of Plug and test bandit again.

Give {:plug, github: "mtrudel/log_decode_error"} a shot and let's see what error comes back.

Added simple logging to the exception message: https://github.com/mtrudel/plug/tree/log_decode_error

GenServer #PID<0.1336731.0> terminating

** (Plug.BadRequestError) Got error :closed
    (plug 1.15.3) lib/plug/parsers/json.ex:114: Plug.Parsers.JSON.decode/3
    (plug 1.15.3) lib/plug/parsers.ex:340: Plug.Parsers.reduce/8
    (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.plug_builder_call/2
    (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint."call (overridable 3)"/2
    (d4 0.2.0) lib/d4_web/endpoint.ex:1: D4Web.Endpoint.call/2
    (bandit 1.2.3) lib/bandit/pipeline.ex:103: Bandit.Pipeline.call_plug/2
    (bandit 1.2.3) lib/bandit/pipeline.ex:24: Bandit.Pipeline.run/6
    (bandit 1.2.3) lib/bandit/http1/handler.ex:33: Bandit.HTTP1.Handler.handle_data/3
Captura de Tela 2024-03-05 às 02 59 32

Huh. That doesn't seem to suggest an obvious cause.

If you keep this branch around while running Cowboy, do you see any similar errors?

Ok, I rolled it back to cowboy, I'll test it with your plug after midnight (GMT-3).

I've been fighting a bug for months and I can't find the cause. When I change from cowboy to bandit the problem gets worse.

My application is a social network communicating with AWS Aurora, everything works very fast with up to 5 thousand users online (less than 0.1 ms on average per request) and I use a pool of 200 connections to the database. When I reach 5.2 thousand users online, all queries/inserts start taking more than 3 seconds (which is my timeout), all ector connections are blocked and all requests return an error, restarting the application (docker) solves the problem and everything responds again in less than 0.1 seconds.

This all happens from one moment to the next, there is no scale. Memory is at 45% and CPU at less than 35% (both in the application and in Aurora)

In the image below I show you the application dropping to zero requests every 5 minutes with bandit, and in the same graph you can see that it remains stable with cowboy (but this also happens, but only 1x or 2x per day).

half leftr is bandit, half right is cowboy. Timer is in gmt-3

Captura de Tela 2024-03-05 às 13 38 49

The problem is such that the Logger stops reporting error records (starts dropping them). You can see the parts where there are no records in the graph below

Captura de Tela 2024-03-05 às 13 38 38

Although the application accuses the database of taking time, the database accuses the application of not reading the responses. At least that's what I understand by this "Client:ClientRead". As the problem gets worse or better by changing libs in elixir, I believe the database is correct. ps.: Time here is GMT+0

Captura de Tela 2024-03-05 às 13 18 24

https://docs.aws.amazon.com/pt_br/AmazonRDS/latest/AuroraUserGuide/apg-waits.clientread.html

Here's an overview, notice that the database's commit latency increases a lot, but as shown in the previous graph, it accuses the application of not reading the response.

Captura de Tela 2024-03-05 às 13 34 10

This error happens with both cowboy and bandit, but it happens much more often with bandit, making it impossible to use it in production.

I know that the bandit and database shouldn't have any relationship, but something happens, and they do. I believe it is relating to the network.

What version of Thousand Island are you running? The nerves folks saw something kinda-sorta-a-little-bit like this that we fixed in 1.3.3

"thousand_island": {:hex, :thousand_island, "1.3.5", "6022b6338f1635b3d32406ff98d68b843ba73b3aa95cfc27154223244f3a6ca5", [:mix], [{:telemetry, "~> 0.4 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "2be6954916fdfe4756af3239fb6b6d75d0b8063b5df03ba76fd8a4c87849e180"},

Are you potentially missing the tracking of a returned conn in one of your plugs? I wonder if this is a similar root cause to #260 and #294 (@alisinabh; do you have thoughts here?)

Note that the offending plug wouldn't have to be the plug which would be handling the request that crashed, since the nature of this bug is that a mis-tracking a conn ends up poisoning future requests on the same connection

If you keep this branch around while running Cowboy, do you see any similar errors?

No. No unknow error with cowboy.

I think I ended up placing 2 different issues in the same topic. Sorry

  1. Bandit raises BadRequestError many times, and this error does not exist in cowboy (at least it is not reported). App has 30~50k req/min, BadRequest occurs less than 40 per minute from what I have seen.

  2. With more than 5 thousand users online (close to 50k req/min) the app freezes and no longer responds to ANY requests (LiveDashboard freeze, everthing freeze witn NO cpu increase), the problem happens with both cowboy and bandit, but with bandit the problem happens many times in a short time and the cowboy can last for many hours.

We are trying to fix item 1 by now.

Are you potentially missing the tracking of a returned conn in one of your plugs? I wonder if this is a similar root cause to #260 and #294 (@alisinabh; do you have thoughts here?)

Note that the offending plug wouldn't have to be the plug which would be handling the request that crashed, since the nature of this bug is that a mis-tracking a conn ends up poisoning future requests on the same connection

I don't think so, I checked all the plugs and they are all correct. I also have more than 200 local tests and I cannot reproduce this error locally, it only occurs in production.

my plugs in order of use:
https://gist.github.com/maxguzenski/d17f6b28fc80a92465583db6d08cbc94

I have to admit, I'm kinda stumped by this. Just to keep the investigation moving:

  • Given the scale of your site, it's not out of the realm of possibility that 0.1% of your requests really are client closures. Keeping in mind the fact that Cowboy is generally pretty quiet about errors, are you absolutely certain that you're not seeing such closures in Cowboy?

  • In contrast with Cowboy, when the client closes the connection Bandit doesn't eagerly kill the handler process; any resources such handlers are using will not be closed until you terminate the plug call yourself. If your app has lots of abandoned connections, it's possible you may be holding on to resources longer with Bandit than with Cowboy. See #202 for more info.

  • Do you have any metrics that describe the state of your node as this happens? Specifically, as I mention above, it's possible you're exhausting sockets / blowing up your run queue due to something in your app holding a bunch of upstream resources.

Yes, it is less the 0.1%. And I think it is a poorly formatted request/sent by the client. But do you think raising an Error would be the best way to deal with this? Maybe it wouldn't be the case to notify it using telemetry?

I'm investigating paid services to send all my telemetry, docker, network and database metrics. At first I'm looking at https://www.datadoghq.com/

Do you have any other telemetry service to recommend?

I believe the underlying issue here has been fixed with 1.5.0. I'm going to close this for hygiene, but if it reoccurs please feel free to reopen!