facebook/proxygen

Issue with HTTP Thread Utilization and Request Backlog Under High Load

jadhavabhi opened this issue · 8 comments

Hi,

I am encountering a issue in my proxygen based http server application where I am using Proxygen HTTP threads as worker threads and Folly IO threads as consumer threads. The configuration includes the default HTTP server options, with a thread count of 240 for Proxygen and 384 for the IO worker threads.

Under high load, I observe that some HTTP threads are handling more than one request simultaneously, reaching up to 8 concurrent requests, while other HTTP threads are not fully utilized. This uneven load distribution is resulting in increased response times, as requests are placed in a wait state.

Additionally, I’ve noticed a significant backlog of gzip-compressed requests, and the load balancer seems to be directing more load to these specific threads.

Could you please advise on how to configure the system to limit the number of concurrent requests handled by each HTTP thread to a maximum of 3? Moreover, I’d like to prevent the accumulation of a specific type of request in the backlog.

Server Configuration: 48 core, Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
Any insights or suggestions would be greatly appreciated.

Thank you.

Hey!

Proxygen HTTP threads as worker threads and Folly IO threads as consumer threads. The configuration includes the default HTTP server options, with a thread count of 240 for Proxygen and 384 for the IO worker threads.

Can you please clarify what you mean here by "Proxygen HTTP worker threads" vs "Folly IO consumer threads"?

Under high load, I observe that some HTTP threads are handling more than one request simultaneously, reaching up to 8 concurrent requests, while other HTTP threads are not fully utilized. This uneven load distribution is resulting in increased response times, as requests are placed in a wait state.

HTTP connections, rather than individual requests, are routed to worker threads. If the HTTP connections are non-uniform (i.e. some connections are sending more requests/data than others), some imbalance may be expected.

Can you please clarify what you mean here by "Proxygen HTTP worker threads" vs "Folly IO consumer threads"?
Proxygen HTTP worker thread means the no of thread count provided in options.thread that is working as producer thread for my application. In EOM event, I am doing some processing like parsing request and validating and after that I submit the request to worker threads i.e IO Executor thread pool and wait for until I get the response back or timer exhaust.
HTTP connections, rather than individual requests, are routed to worker threads. If the HTTP connections are non-uniform (i.e. some connections are sending more requests/data than others), some imbalance may be expected.
Is there any way to control the pending request per connection if the thread is not completed the ongoing request? does tuning listen backlog will help here?

Is there any way to control the pending request per connection if the thread is not completed the ongoing request? does tuning listen backlog will help here?

Yep, it mostly depends on protocol:

  • http/1.1 by design only supports 1 concurrent request per connection
  • http/2 has a max_concurrent_streams setting
  • http/3, the underlying protocol (QUIC in this case) controls the number of concurrent streams via the MAX_STREAMS frame

HTTPServerOptions has a maxConcurrentIncomingStreams member that should configure this accordingly based on the protocol being used.

configuration includes the default HTTP server options, with a thread count of 240 for Proxygen

That seems pretty high? Normally the number of HTTP workers shouldn't exceed the number of CPUs you have, as they are designed to be non-blocking. Reducing the thread count ought to smooth out your distribution some?

backlog of gzip-compressed requests

This is a potential problem, since gzip compression is CPU intensive, but the default filters run on the IO threads. If compression is a bottleneck, you may want to uninstall the proxygen compression/decompression filters from the HTTPServer, and instead handle compression/decompression in your folly CPU bound thread pool. You can then tune the IO : CPU worker ratio and hopefully even out your core usage.

Thanks @afrind and @hanidamlaj for your input.
I have revised my design to introduce an additional IO thread pool between the proxygen threads and the CPU thread pool, specifically for handling timers during request processing. The HTTP workers are now non-blocking, submitting jobs to the IO thread pool while passing the event base and downstream objects. This change has reduced the accumulation of requests; however, with 96 HTTP worker threads, the system is still not scaling efficiently, even though CPU utilization is only at 50-60% on a 96-core machine, and both the IO threads and CPU threads are within their maximum limits (IO threads at ~80% and CPU threads at ~50%).

I attempted to increase the number of HTTP worker threads to 192, which somewhat improved the distribution, but occasional hiccups in QPS still occur. It appears that requests are accumulating in the HTTP threads, with some threads handling more than 10 requests.

Additionally, I have not enabled the enableContentCompression setting under HTTP options, so I assume the HTTP layer is not performing compression or decompression tasks.

Could you suggest an effective way to tune the HTTP server threads or help identify the bottleneck causing request accumulation in the HTTP threads?

Hi @afrind and @hanidamlaj ,
My worker threads are processing requests at a slower rate of approximately 10 requests per second, while the HTTP threads are handling requests much faster due to their non-blocking nature. As a result, each HTTP thread (equal to the number of cores) is simultaneously handling around 500 requests. This imbalance may be causing starvation or delayed scheduling of my flush events on the event base, leading to a slowdown in throughput and timeouts. Is there a way to limit the number of requests per thread to prevent flush events from entering a starvation state?

Hi @hanidamlaj ,
This is the gdb stack of the 'IOThreadPool0' thread at the time the process became unresponsive under high load and failed to recover. During this period, I observed a backlog of 1500 requests per HTTP thread, while all worker threads were in a waiting state. Please let me know if this information aids in diagnosing the issue.

Thread 1 (Thread 0xXXXXXX (LWP XXXXXXX) "IOThreadPool0"):
#0  0xXXXXXX in epoll_wait (epfd=XXXXX, events=0xXXXXXXXXXX, maxevents=32, timeout=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0xXXXXXX in epoll_dispatch () from /usr/local/lib/libevent-2.1.so.7
#2  0xXXXXXX in event_base_loop () from /usr/local/lib/libevent-2.1.so.7
#3  0xXXXXXX in (anonymous namespace)::EventBaseBackend::eb_event_base_loop (flags=1, this=<optimized out>) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/io/async/EventBase.cpp:74
#4  folly::EventBase::loopMain (this=0xXXXXXXXXXX, flags=<optimized out>, ignoreKeepAlive=false) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/io/async/EventBase.cpp:425
#5  0xXXXXXX in folly::EventBase::loopBody (this=0xXXXXXXXXXX, flags=0, ignoreKeepAlive=<optimized out>) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/io/async/EventBase.cpp:356
#6  0xXXXXXX in folly::EventBase::loop (this=0xXXXXXXXXXX) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/io/async/EventBase.cpp:335
#7  0xXXXXXX in folly::EventBase::loopForever (this=0xXXXXXXXXXX) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/io/async/EventBase.cpp:566
#8  0xXXXXXX in folly::IOThreadPoolExecutor::threadRun (this=0xXXXXXXXXXX, thread=...) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/executors/IOThreadPoolExecutor.cpp:253
#9  0xXXXXXX in std::__invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__f=<optimized out>, __t=<optimized out>, __f=<optimized out>, __t=<optimized out>) at /usr/include/c++/11/bits/invoke.h:74
#10 std::__invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>) at /usr/include/c++/11/bits/invoke.h:96
#11 std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (__args=..., this=<optimized out>) at /usr/include/c++/11/functional:420
#12 std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::operator()<, void>() (this=<optimized out>) at /usr/include/c++/11/functional:503
#13 folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at proxygen-2023.11.20.00/proxygen/_build/deps/folly/folly/Function.h:349
#14 0xXXXXXX in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0xXXXXXXXXXX) at proxygen/_build/deps/include/folly/Function.h:378
#15 0xXXXXXX in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0xXXXXXXXXXX) at proxygen/_build/deps/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#16 0xXXXXXX in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#17 0xXXXXXX in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#18 0xXXXXXX in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0xXXXXXXXXXX) at /usr/include/c++/11/bits/std_thread.h:259
#19 0xXXXXXX in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0xXXXXXXXXXX) at /usr/include/c++/11/bits/std_thread.h:266
#20 0xXXXXXX in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0xXXXXXXXXXX) at /usr/include/c++/11/bits/std_thread.h:211
#21 0xXXXXXX in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#22 0xXXXXXX in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0xXXXXXX in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81