Server appears to hang after some period of uptime

Question

Server appears to hang after some period of uptime

bfolkens opened this issue 8 months ago · 11 comments

Versions

Docker image elixir:1.15.6-otp-25
thousand_island 1.3.2
bandit 1.2.2

Summary

We recently migrated to Bandit from cowboy over the weekend, and have been tracking a problem whereby the server appears to hang after some period of uptime. Connections drop and new connections appear to hang, which causes the healthcheck to fail and the Pod to be restarted. Once the restart completes, traffic resumes normally. This is happening on two different websites that have a substantial amount of WebSocket traffic (in addition to high HTTP traffic).

We're using the HTTP (non-SSL) transport and HTTP 1 and WebSocket protocols.

We may end up reverting the migration back to cowboy for the remainder of the week to test if something in Bandit is indeed the cause.

Detail

Below is a telemetry chart of the VM's port count shortly after we migrated to Bandit. Each unique color is a Pod.

Zoomed in to the most recent several hours:

Finally, the last two incidents below in parallel with the run_queue length. Note that the active ports drop first, followed by a spike in run_queue:

We'll also attempt to monitor the disconnections to see if there is some correlation.

Answer 1 · 2024-02-20T16:37:02.000Z

This definitely rhymes with the issue the nerves folks are seeing (we're working it via IM, so there's no issue for it at the moment). I've got a bunch of leads on it, but nothing concrete yet.

Any chance you're doing any possibly blocking work in your channels' terminate functions? Things like DB writes or the like?

Answer 2 · 2024-02-20T16:37:51.000Z

I'm going to keep working it with them, mostly because they've got a working repro environment. I'll make sure to loop this ticket once we make any progress

Answer 3 · 2024-02-20T16:41:09.000Z

Fantastic, thanks @mtrudel !

Any chance you're doing any possibly blocking work in your channels' terminate functions? Things like DB writes or the like?

No, no overrides, just the default behavior.

Answer 4 · 2024-02-20T20:17:50.000Z

What load balancer is in front of this? AWS 'network' mode?

Answer 5 · 2024-02-20T21:57:06.000Z

We're using a network endpoint group (NEG) in a load balancer through Kubernetes (GKE, Ingress) on Google Cloud.

Answer 6 · 2024-02-21T02:18:04.000Z

Hey @bfolkens,

I have a hunch about a possible cause of this issue. Since you mentioned that you deal with a high websocket traffic (which is long running in nature) one possible issue might be that if you hit maximum number of connections momentarily (default in bandit would be 100 * 16384) then there is a possibility that the retry mechanism in thousand_islands keeps new connections in a retry loop.

There is an easy way that we could rule this out. By setting max_connections_retry_count: 0 you can disable this retry behaviour and test if it resolves this issue or not.

To set this in a phoenix application you can add the :thousand_island_options to :http (and possibly :https if applicable) config of your endpoint in your config.exs of phoenix application. Like this example below:

config :my_phoenix_app, MyPhoenixAppWeb.Endpoint,
  http: [
    ip: ...,
    port: ...,
    thousand_island_options: [max_connections_retry_count: 0]
  ]
  ...

Answer 7 · 2024-02-21T17:56:07.000Z

I think we may have solved this in Thousand Island 1.3.3 (just released). Can you bump your Thousand Island dep and see if that solves things?

Answer 8 · 2024-02-21T22:07:08.000Z

@alisinabh - thanks for the input. Based on our stats I think we should be well under this limit since we're seeing about 3-4k ports open at peak. Since 1.3.3 is out now will try that first and if it still doesn't work I'll try the max_connections_retry_count instead.

Answer 9 · 2024-02-21T22:08:19.000Z

Looking forward to hearing if 1.3.3 fixes it!

Answer 10 · 2024-02-23T03:01:09.000Z

Great news - I think that did it. We've had 24+ hours now without incident with 1.3.3 so that seems to have fixed it. Thanks again @mtrudel for the quick response!

Answer 11 · 2024-02-23T03:12:16.000Z

Great to hear!