Using multiple backends can lead to throughput loss
Closed this issue · 3 comments
As of 4864a6d, I see the following usage for inc-nl when running experiments/web:
The usage should be much higher (starting out much closer to the dotted line and gradually decreasing). If I change num_AA_backends to 1 then the issue goes away:
Not sure what the problem is. We might have too few connections, too many connections, a bottleneck in Envoy, or some issue with the Envoy config.
Looks like the problem is the number of connections Envoy is trying to establish. If we reduce the number of fortio servers (by removing a serve_port) then this looks better. I guess a "correct" fix is to have sensible limits on max_connections, max_pending_requests, and max_requests.
Since the problem is overload between backends and Envoy, we can mitigate as noted in the previous comment. Not impacting any current results.
Following up after seeing this again: the problem appears to CPU starvation in Envoy. One thing that has helped is enabling http2 communication between clients and envoy (envoy already spoke http2 to the backends, now all communication is cleartext http2).
Overall it seems like we are just barely touching this threshold where Envoy experiences CPU overloaded, and if we're slightly under things are OK.
Best thing would be to use bigger machines for Envoy. Unfortunately, only xl170 machines can use the user-allocatable switches on cloudlab. So to do this, we'd need to set up relay xl170 machines that NAT traffic to the actual machines over another network. However, having xl170 machines connected to two networks is blocked by a cloudlab issue.