facebook/fbthrift

The bug in folly/fbthrift interaction with libevent 2.*.

alex-700 opened this issue · 4 comments

Hello, fbthrift team!

There is serious bug in using INTERNAL libevent flag in folly:EventBase and fbthrift:IOWorkerContext.

Simple reproducer

Simplest "return constant version" server responds with delay (correlated with folly::detail::MemoryIdler::defaultIdleTimeout).

Code:
// versionable.thrift
namespace cpp thrift_test

service Versionable {
  string version();
}
# CMakeLists.txt
include(ThriftLibrary.cmake)

set(THRIFT1 <...>)
set(THRIFTCPP2 FBThrift::thriftcpp2)

thrift_library(
  versionable                  # file_name
  Versionable                  # services
  cpp2                         # language
  terse_writes,stack_arguments # options
  ${CMAKE_CURRENT_SOURCE_DIR}  # file_path
  ${CMAKE_CURRENT_BINARY_DIR}  # output_path
  ${CMAKE_CURRENT_BINARY_DIR}  # include_prefix
)
target_link_libraries(versionable-cpp2-obj FBThrift::thriftcpp2)

include_directories(${CMAKE_CURRENT_BINARY_DIR})
add_executable(thrift-test-server server.cpp)
target_link_libraries(thrift-test-server versionable-cpp2 FBThrift::thriftcpp2)
add_executable(thrift-test-client client.cpp)
target_link_libraries(thrift-test-client versionable-cpp2 FBThrift::thriftcpp2)
// server.cpp
#include <thrift/lib/cpp2/server/ThriftServer.h>

#include <folly/init/Init.h>
#include <glog/logging.h>

#include "gen-cpp2/Versionable.h"

struct versionable_service_t : thrift_test::cpp2::VersionableSvIf
{
   folly::Future<::std::string> future_version() override { return {"dev"}; }
};

DEFINE_uint32(memory_idler_timeout, 5, "Memory idler default timeout");

int main(int argc, char ** argv)
{
   FLAGS_logtostderr = true;
   folly::init(&argc, &argv);
   folly::detail::MemoryIdler::defaultIdleTimeout.store(std::chrono::seconds(FLAGS_memory_idler_timeout));
   apache::thrift::ThriftServer server;
   server.setProcessorFactory(
      std::make_shared<apache::thrift::ThriftServerAsyncProcessorFactory<versionable_service_t>>(
         std::make_shared<versionable_service_t>()));
   server.setIdleTimeout(std::chrono::milliseconds::zero());
   server.setPort(7777);
   server.serve();
}
// client.cpp
#include <thrift/lib/cpp2/async/HeaderClientChannel.h>

#include <folly/init/Init.h>
#include <glog/logging.h>

#include "gen-cpp2/VersionableAsyncClient.h"

int main(int argc, char ** argv)
{
   FLAGS_logtostderr = true;
   folly::init(&argc, &argv);

   folly::EventBase evb;
   thrift_test::cpp2::VersionableAsyncClient client(
      apache::thrift::HeaderClientChannel::newChannel(folly::AsyncSocket::newSocket(&evb, "127.0.0.1", 7777)));

   for (size_t i = 0; i < 10; ++i)
   {
      std::string response;
      client.sync_version(response);
      VLOG(0) << response;
   }
}
Execution:

image

This server [while being idled] is responding slowly delayed with half-normal distribution near MemoryIdler::defaultTimeout.

The reason of this behavior.

When server is waiting on event base loop, it can't wake up on putMessageInReplyQueue, because the reply queue was registered here in internal mode. Internal mode for this queue means that we can't go out from loop via EVLOOP_ONCE by event (and executed callback) on this queue, because n is zero. And the only event (non-internal), which can wake the event loop up, in my example, is timer event for MemoryIdler.

Thoughts

  • Explicit event_base_loopbreak from messageAvailable was removed in commit. But the following removed comment is still true.
    // In libevent2, internal events do not break the loop.
    // Most users would expect loop(), followed by runInEventBaseThread(),
    // to break the loop and check if it should exit or not.
    // To have similar bejaviour to libevent1.4, tell the loop to break here.
    // Note that loop() may still continue to loop, but it will also check the
    // stop_ flag as well as runInLoop callbacks, etc.
  • The call of startConsumingInternal [kind of] violates the contract of the function described here. This queue is managed outside the EventBase. For instance, it prevents switch INTERNAL mode for this queue inside EventBase here for keepAlive mode.
  • It seems the main purpose of using INTERNAL in folly is fast exit path by "no event registered". But INTERNAL also affects another exit-path, which folly seems to ignore now. I think that if second exit-path will work as if "there is no INTERNAL events" then this problem will gone (see "Possible solution").
  • There is AsyncTimeout class which can use INTERNAL logic too (from it public constructor). It can generate the similar bug: when INTERNAL timer shoot, libevent callbacks will be executed, but event_base_loop does not have to go out (and does not have to transfer control to folly::EventBase callbacks).

Possible solution

Actually, event_base_loopbreak-ish from first of "Thoughts" seems needed here, but maybe with other implementation. Hard breaking with event_base_loopbreak go out from libevent loop almost immediately, and don't give a chance to execute other libevent callbacks (big amounts of switch (libevent callbacks vs folly::EventBase callbacks) => worse performance).

For example, this problem will gone if somebody guarantees that after every group of libevent callbacks on INTERNAL events in event loop iteration any [non-internal] libevent callback will be register to call (it can be zero-timer (which register need-to-execute callback immediately)). This idea removes the problem of second exit-path, because every time when internal callback is executed (some "stoptoken callback" will be generated) and we exit event_base_loop at most on the next iteration.

Specific implementation of it can be the following. Call new EventBase::onInternalCallbackEnd from the end of libeventCallback in the case of INTERNAL. In this new EventBase::onInternalCallbackEnd register zero-timer event (do not register if it was already registered on this iteration of loop).

Hi!

I just tried to reproduce it on our internal build. Specifically what I did was that I completely removed all the MemoryIdler code from IOThreadPoolExecutor (to make sure that there are no spurious EventBase wakeups triggered by it) and then ran all the Thrift unit tests - observing no failures. I suspect that there may be some problem in EventBase compatibility with your libevent version. Could you please try the same experiment with unit tests + disabled MemoryIdler and let me know if you are seeing deadlocks/failures?

The internal/external events only matter wrt .loop() vs .loopForever(). loop() keeps looping the EventBase only while there are any registered external events, loopForever() keeps looping the EventBase until terminateLoopSoon() is called. IOThreadPoolExecutor is using loopForever() (https://github.com/facebook/folly/blob/a25c6a2cfc675c4e62621cfa5756ce429b38cd85/folly/executors/IOThreadPoolExecutor.cpp#L250) so internal/external shouldn't make any difference.

Thanks for the quick reply!

I suspect that there may be some problem in EventBase compatibility with your libevent version.

Most of my description is a theoretic description of the problem with the links on the current main branches of libevent/folly/fbthrift. So I'm pretty sure that my example can be reproduced on main branches. I have checked it for all of 2021.07.22.00, 2022.03.07.00 and 2022.03.14.00 of folly/wangle/fizz/fbthrift and libevent 2.1.12.

Could you please try the same experiment with unit tests + disabled MemoryIdler and let me know if you are seeing deadlocks/failures?

It is not easy. I think that with high probability fbthrift unit-tests can't catch this problem, because it has kinda higher level than module (unit). And the main (or last) villain here is IOWorkerContext (namely startConsumingInternal) which has no any unit-tests on.

The internal/external events only matter wrt .loop() vs .loopForever().

It can be true only for the queue managed by EventBase. It explicitly switches the internal/external mode of event here. But any other queues (e.g. one from IOWorkerContext) know nothing about this switch and manage internality by itself.

loop() keeps looping the EventBase only while there are any registered external events, loopForever() keeps looping the EventBase until terminateLoopSoon() is called.

Yeap, this part of a contract is not violated. But there are two different [interleaving] "looping" in the [e.g.] loop(). First one is inner libevent loop, and the second one is outer folly::EventBase loop. The second one calls the first one with option EVLOOP_ONCE, and hopes that it will return after execution of any callback. But this loop does not return if callbacks on INTERNAL event will executed only (n == 0, because only callbacks on non-INTERNAL events are considered). And it makes outer loop starve (the next callback from loopcallbacks_ can't be executed). And outer loop will wait an execution of any callback on non-INTERNAL event (in my example, it is timeout from MemoryIdler, but it does not matter).

let me draw your attention to this part of my previous comment.

It seems the main purpose of using INTERNAL in folly is fast exit path by "no event registered". But INTERNAL also affects another exit-path, which folly seems to ignore now.

internality gives you not only earlier exit in the case of "no external events", but also make executed callback on INTERNAL event be ignored for exit via EVLOOP_ONCE.

I tried running your repro (the only thing I changed was using a different service interface) and it still doesn't fail on our build. The key difference is that we use libevent 1.4.14b internally.

I think that with high probability fbthrift unit-tests can't catch this problem

We have a number of unit tests that start a thrift server (via ScopedServerInterfaceThread), so I was suspecting that those would catch the problem.

let me draw your attention to this part of my previous comment.

Thanks for flagging this! I re-checked the EventBase logic and it indeed has special handling for the internal event mode of its own queue (https://github.com/facebook/folly/blob/main/folly/io/async/EventBase.cpp#L436-L438). That said, in this particular case IOThreadPoolExecutor is using loopForever(), so EventBase's own queue will be registered as an external event and so event_base_loop should not return early.

The key difference is that we use libevent 1.4.14b internally.

This is very important note! In this version internal callbacks are not ignored for EVLOOP_ONCE, so this bug will not be reproduced. I thought that you use libevent 2.*, because manifest declares it. Is looking into this manifests wrong to determine the version?

Did I get it right that fbthrift/folly does not support libevent 2.*?

P.S. libevent version is so important for this discussion, so I add the version to the title.