Increased memory usage after switching from Cowboy to Bandit
Nilsonn opened this issue · 31 comments
Hey,
we recently migrated a Phoenix application from Cowboy to Bandit, following the recommendation of the Phoenix framework for future projects.
After the transition, we have observed an increase in memory consumption by our application. Previously, our application used approximately 300 MB of memory, but under Bandit we reach our peak at over 400 MB.
Correlating with the increased memory, we see an increase in processes with a peak of approximately 200 more processes.
Used Bandit Versions: 1.2.0-1.2.2
Elixir Version: 1.16.0
Erlang: 26.2.1
OS: Ubuntu Jammy
We have not made any significant changes to our codebase other than the web server switch, and the increased memory usage is consistent across multiple deployments. Further, we did not get a lot more users.
Is the increase in memory and processes expected?
Let me know if I can provide anything else.
Best regards
Nils
What version of thousand island are you using?
What web traffic (http/weboscket) and transport (http/https) are you using?
Thousand Island Version: 1.3.2
We have a few Phoenix LiveViews with websocket connections. But they are used very rarely.
The main traffic goes via HTTPS to our AWS Loadbalancer and from there via HTTP to the Phoenix server.
Try bumping thousand island to 1.3.5 and see if that resolves the issue; it's been the likeliest answer for most people in your situation.
If it does resolve, please report back so we get more data points to on fixes! Thanks
Interesting! What are the markers on those charts?
Do you mean the rocket and the dots below the charts?
Those represent deployments. The rocket deployment was the update to thousand island 1.3.5 and Bandit 1.2.3.
And the two colored lines represent two nodes that are running.
Do you need any more information to understand this better?
@mtrudel Allow me to share the same experience as described by @Nilsonn.
We ran 3 experiments in the past couple days (30min, 30min, 6hours), where a new pod running Bandit was introduced alongside to other pods running Cowboy.
I have some limitations on what I can share, but find below a comparison of the max memory usage percentage for the Bandit pod (solid colored line) against the min and max memory usage of all other pods (dotted gray lines).
Traffic is exclusively on HTTP/1 (not HTTPS) through an Envoy proxy.
elixir 1.16.1-otp-25
erlang 25.3.2.9
ubuntu: jammy-20240125
phoenix: 1.7.11
bandit: 1.2.3
thousand_island: 1.3.5
30 minutes
30 minutes
6 hours
Sorry for the delayed response; I was on holiday last week.
@Nilsonn I wouldn't worry about the process count increase. It comes directly from the fact that Thousand Island maintains a separate handler supervisor process for each acceptor tree (see the README for details, or if you're REALLY curious, my EMPEX talk on exactly this). All of these extra processes are part of the supervision tree for each acceptor, and will scale linearly with the number of acceptor processes (which is constant), and not the number of client connections.
The memory usage is something that I've been hearing a lot more of as people are adopting the project, and I think stem from how load balancers distribute traffic and manage connections to Bandit nodes.
Some data points that would help (both from @Nilsonn, @ianko and anyone else lurking here):
- What load balancer do you use (in particular, if AWS is it 'Network' or 'Application')
- What type of traffic (HTTP1/HTTP2/WS) and protocol (HTTP/HTTPS) dominates your use case
- Do you have any other metrics regarding BEAM process count / # of TCP sockets that you can share (ideally presented alongside memory growth so we can correlate)
- Are you willing/able to run a branch of Bandit to see how it affects the symptoms you see? This sort of thing usually requires a few back and forth iterations since it's the sort of thing that often only comes up in production
My current hunch (that I'm going to work to reproduce here at home) is this:
- Bandit uses a single process per connection, so multiple HTTP (and thus Plug) requests on a single TCP connection will use the same process. This is by design.
- In contrast, Cowboy uses a single process per connection, but spins out a separate process for each HTTP request that comes in on that connection.
- This means that Cowboy doesn't need to think about GC'ing between HTTP requests, since the request process only lasts the duration of a single request.
- Load balancers like to keep a pool of upstream connections. This means that we can have a potentially huge number of requests performed in each Bandit connection process, with no explicit GC performed in between them
I'm working to get a repro of this together early this week
What load balancer do you use (in particular, if AWS is it 'Network' or 'Application')
We use an Application load balancer. Istio to be more be specific.
After reading the mtrudel/thousand_island#115 you just linked, it did ring a bell here. After migrating to Istio, we saw an increase in reconnections to Redis due to :tcp_closed
events. Either using eredis
or redix
.
So, you are on point when suspecting of the load balancer.
What type of traffic (HTTP1/HTTP2/WS) and protocol (HTTP/HTTPS) dominates your use case
In my tests, it was 100% HTTP/1 (no ssl).
Are you willing/able to run a branch of Bandit to see how it affects the symptoms you see? This sort of thing usually requires a few back and forth iterations since it's the sort of thing that often only comes up in production
Definitely. I will also try to run experiment in a service without Istio.
Do you have any other metrics regarding BEAM process count / # of TCP sockets that you can share (ideally presented alongside memory growth so we can correlate)
This is what I can share for the 6h experiment we ran. The dotted lines are the min
and max
values for each metric for the pods running Cowboy, while the colored line is Bandit's max
values for the same metric.
Those are super helpful @ianko! Right off the top, it looks like memory is the only thing that's out of control; the constant increase in process is expected given Thousand Island's process model (Thousand Island does per-acceptor worker supervision to avoid contention, which results in a (constant) larger number of supporting processes. As a result, I'd expect to see a similar graph, just shifted upwards by a small & constant amount (you didn't include the scale, but I'd guess the difference to be a small integer multiple of your num_acceptors
value (default 100))). Similarly, run queue and socket counts are as expected.
@ianko, could you give {:bandit, github: "mtrudel/bandit", branch: "gc_experiment"}
a shot and see what happens to memory usage?
The format of your graph (with the dashed lines representing cowboy) is perfect! (where are you generating that, btw?)
Based on https://github.com/mtrudel/bandit/tree/gc_experiment, it adds a GC between every keepalive request (overkill, but a quick check to see if keepalive is the root cause)
Will try for 30min again and report back.
The format of your graph (with the dashed lines representing cowboy) is perfect! (where are you generating that, btw?)
That's from Datadog. I run Bandit as an Argo Expriment pod and let Datadog aggregate the min and max on the other pods using cowboy.
That's from Datadog. I run Bandit as an Argo Expriment pod and let Datadog aggregate the min and max on the other pods using cowboy.
What a great product datadog is
Great news! I'm going to do a bit of cleanup on this and get it released ASAP, with a configurable option for gc_every_n_requests
defaulting to 1.
Once this gets released, would you be able to test a couple of different values for that and see which works best? 1 is pretty extreme, and my hunch is that something a bit higher is probably a better tradeoff of some transient memory usage for improved inter-request latency. Again, one of those things you can only really suss out in production
@ianko do you have a comparative graph of 'HTTP 99th' latency' handy comparing the previous versions of Bandit to Cowboy? I'm trying to suss out if GC'ing between keepalives has an effect on response latency. Any other metrics you may have that would help compare the difference with Cowboy pre/post this new fix would be great as well.
@mtrudel I'm currently running a 6h experiment to compare against that other one shared previously.
A configurable option would be great. I guess if someone wants the current behavior, they would set gc_every_n_requests
to zero, right?
would you be able to test a couple of different values for that and see which works best?
For sure!
do you have a comparative graph of 'HTTP 99th' latency' handy comparing the previous versions of Bandit to Cowboy?
Here are the latency for previous experiments I shared previously.
30min (Bandit 1.2.3)
30min (Bandit 1.2.3)
This was one pod that got a request for a long running query and then pulled the max metrics up.
6h (Bandit 1.2.3)
30min (Bandit gc_experiment
)
I don't really see a whole lot of difference in latency between bandit pre/post fix, do you agree?
I pushed a different approach to gc_experiments
that uses the fullsweep_after
option rather than manually GC'ing. Can you give it a shot?
@mtrudel This is using the fullsweep_after: 0
approach. The most noticeable difference for me was the CPU usage. This latest change seems to be more CPU intensive, closer to Cowboy's usage (see table at the end).
AVG CPU (moving average over span of 5)
Before - manual GC | After - fullsweep_after: 0 |
---|---|
Fascinating. It makes sense that explicit GC and fullsweep_after: 0
are equally effective at solving the issue & have similar performance characteristics in everything except CPU, but full sweep being MORE CPU intensive is just wild; I'd have expected the exact opposite.
Just to compare apples to apples, the before and after charts above; are they on the same Y scale?
(Also, I can't thank you enough for the quality of these charts and your willingness to tinker; your work is making a really hard-to-repro issue a joy to work through).
Just to compare apples to apples, the before and after charts above; are they on the same Y scale?
Better comparison below. I've locked the Y-Axis to the same scale on both graphs. Important to notice that the "after" run had about 20% less traffic in comparison to the first run.
- Dotted lines are the
min
andmax
for Bandit. - Colored lines are the moving average for Bandit (for smoother visualization and comparison).
- Dashed lines are the moving average for the regular pods using Cowboy.
Before - manual GC
After - fullsweep_after: 0
I can run again a longer experiment for better sampling.
Both solved the issue with no noticeable drawback (in my view). Using the fullsweep_after
, the CPU usage is basically the same as Cowboy.
(Also, I can't thank you enough for the quality of these charts and your willingness to tinker; your work is making a really hard-to-repro issue a joy to work through).
No, thank you for the willingness to solve this so fast.
Yeah, that's pretty conclusive. I guess the explicit GC wins.
Still, explicit GC makes me nervous. I'm going to put a call out on the Elixir Forum to see if any VM wizards can chime in on unforeseen side effects. I'm not going to dwell on it too long, but I'd love to have an expert chime in.
Barring disaster from said expert, I'm thinking we can merge & ship the explicit GC version of this maybe EOD tomorrow.
Since this is such a low-level BEAM question, did you want to cross post it to the Erlang forum?
Based on the discussion on the forum thread, I think I'm ready to merge in the 'full GC' approach which I've pushed up as #322.
Looking through the P95 charts above, I think I do in fact see a negative impact of the GC approach on request latency. I'd like to make one last effort to improve this before merging.
@ianko would you be able to run one more test of the PR branch (note the branch name has changed from the experiment above; it's now gc_on_keepalives
). I upped the interval to only GC every 5 requests by default, hopefully that's a good balance between memory usage & latency.
@mtrudel Ran an experiment for 1h. This time we had some abnormal spikes in the 99th latency. This is not related to this fix. Other pods running Cowboy eventually saw the same behavior.
My take on these metrics is that CPU usage is lower than fullsweep_after: 0
and a little bit higher than the first approach. Memory usage still as steady as it can be.
Overall, I'm extremely satisfied with the solution. No penalties on Latency and lower memory and CPU usage than Cowboy, and we will have control to configure as needed.
Let me know if you want to see any metric or comparison in particular.
Those are great results; thanks @ianko! Agree that there's nothing holding this back from a merge & release. I'll get this out first thing in the morning.
Thanks again for all your help. You really made this easy.
Released as 1.3.0