Tradias/asio-grpc

single threaded asio client falls in to infinite loop when integrated with asio grpc

RavikumarTulugu opened this issue · 19 comments

I am seeing a tight loop with below stack trace. The application is a boost asio single threaded app which is now integrated with asio grpc. The boost version is 1.81. I have followed the client example in the examples directory ( share-io-context-client.cpp ).
Please throw some light on what could possibly go wrong.

(gdb) bt
#0 0x00007fb21bf778e9 in __GI___libc_realloc (oldmem=0x557a957fd180, bytes=90) at ./malloc/malloc.c:3496
#1 0x00007fb21cd7d8e5 in gpr_realloc () from /lib/x86_64-linux-gnu/libgrpc.so.10
#2 0x00007fb21cd06fe5 in grpc_error_string(grpc_error*) () from /lib/x86_64-linux-gnu/libgrpc.so.10
#3 0x00007fb21cd60518 in ?? () from /lib/x86_64-linux-gnu/libgrpc.so.10
#4 0x00007fb21ce5015c in grpc_impl::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) () from /lib/x86_64-linux-gnu/libgrpc++.so.1
#5 0x0000557a66c7cddd in grpc_impl::CompletionQueue::AsyncNext<gpr_timespec> (this=0x557a6766ddf0, tag=0x7fff16a2e388, ok=0x7fff16a2e390, deadline=...) at /usr/include/grpcpp/impl/codegen/completion_queue_impl.h:198
#6 0x0000557a66c7ccd0 in agrpc::b::detail::get_next_event (cq=0x557a6766ddf0, event=..., deadline=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:165
#7 0x0000557a66c7c6c4 in agrpc::b::detail::GrpcContextImplementation::handle_next_completion_queue_event (grpc_context=..., deadline=..., invoke=agrpc::b::detail::InvokeHandler::YES)
at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:173
#8 0x0000557a66c7c4bd in agrpc::b::detail::GrpcContextImplementation::do_oneagrpc::b::detail::IsGrpcContextStoppedPredicate (grpc_context=..., deadline=..., invoke=agrpc::b::detail::InvokeHandler::YES, stop_predicate=...)
at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:214
#9 0x0000557a66c7ba9a in agrpc::b::detail::GrpcContextDoOne::poll (grpc_context=..., deadline=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:164
#10 0x0000557a66c7b6fa in agrpc::b::detail::run_impl<agrpc::b::detail::GrpcContextDoOne, agrpc::b::DefaultRunTraits, boost::asio::io_context, agrpc::b::detail::AreContextsStoppedConditionboost::asio::io_context > (grpc_context=...,
execution_context=..., stop_condition=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:191
#11 0x0000557a66c7b5bd in agrpc::b::run<agrpc::b::DefaultRunTraits, boost::asio::io_context, agrpc::b::detail::AreContextsStoppedConditionboost::asio::io_context > (grpc_context=..., execution_context=..., stop_condition=...)
at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:214
#12 0x0000557a66c7b545 in agrpc::b::run<agrpc::b::DefaultRunTraits, boost::asio::io_context> (grpc_context=..., execution_context=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:207
#13 0x0000557a668c11d2 in main::$_3::operator() (this=0x7fff16a2e718) at
#14 0x0000557a668c1185 in boost::asio::detail::binder0main::$_3::operator() (this=0x7fff16a2e718) at opensource/base/boost_1_81_0/boost/asio/detail/bind_handler.hpp:60
#15 0x0000557a668c1165 in boost::asio::asio_handler_invoke<boost::asio::detail::binder0main::$_3 > (function=...) at opensource/base/boost_1_81_0/boost/asio/handler_invoke_hook.hpp:88
#16 0x0000557a668c113f in boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0main::$_3, main::$_3> (function=..., context=...) at opensource/base/boost_1_81_0/boost/asio/detail/handler_invoke_helpers.hpp:54
#17 0x0000557a668c110d in boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0main::$_3, main::$_3> (function=..., this_handler=0x7fff16a2e718) at opensource/base/boost_1_81_0/boost/asio/detail/bind_handler.hpp:111
#18 0x0000557a668c0ffd in boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0main::$_3, boost::asio::detail::binder0main::$_3 > (function=..., context=...)
at opensource/base/boost_1_81_0/boost/asio/detail/handler_invoke_helpers.hpp:54
#19 0x0000557a668c12d3 in boost::asio::detail::executor_op<boost::asio::detail::binder0main::$_3, std::allocator, boost::asio::detail::scheduler_operation>::do_complete (owner=0x557a67661f00, base=0x557a97e2c740)
at opensource/base/boost_1_81_0/boost/asio/detail/executor_op.hpp:70
#20 0x0000557a6697293e in boost::asio::detail::scheduler_operation::complete (this=0x557a97e2c740, owner=0x557a67661f00, ec=..., bytes_transferred=0) at opensource/base/boost_1_81_0/boost/asio/detail/scheduler_operation.hpp:40
#21 0x0000557a669722a7 in boost::asio::detail::scheduler::do_run_one (this=0x557a67661f00, lock=..., this_thread=..., ec=...) at opensource/base/boost_1_81_0/boost/asio/detail/impl/scheduler.ipp:492
#22 0x0000557a66971d77 in boost::asio::detail::scheduler::run (this=0x557a67661f00, ec=...) at opensource/base/boost_1_81_0/boost/asio/detail/impl/scheduler.ipp:210
#23 0x0000557a66954e9e in boost::asio::io_context::run (this=0x7fff16a2ff78) at opensource/base/boost_1_81_0/boost/asio/impl/io_context.ipp:63
#24 0x0000557a668bf69c in main (argc=40, argv=0x7fff16a33c18)

I assume that by infinite loop you mean that "the application blocks forever".

The usual flow of an asio application is that you submit some work to an execution context and then run() it. In the example we add some work to the io_context and then we run it. The call to run() will block until all work is done.
If you would like asio and asio-grpc to work on a background thread then you can simply do this. Note that you probably want to create a asio::make_work_guard(io_context) to prevent run() from returning early in that case. And the end of your program you simply release the work guards and it will run to completion cleanly.

i am sorry for being vague, infinite loop in the sense the application is busy polling and traversing the mentioned code path when there is no work. I have not added work guard , let me get back after adding the work guard. The application main loop is a io_context and the asio grpc is in a post lambda like below.

boost::asio::post( ioc,
[&] {
ioc.get_executor().on_work_finished();
agrpc::run ( grpc_context, ioc );
ioc.get_executor().on_work_started();
});

I understand. This behavior is expected. It is the unfortunate side-effect of trying to run two event loops on the same thread. It should however not consume any notice amount of CPU. If it does then you could adjust the latency. Let me know if you do and what value works for you:

struct RunTraits
{
    /**
     * @brief The desired maximum latency
     *
     * The maximum latency between consecutive polls of the execution context.
     */
    static constexpr std::chrono::microseconds MAX_LATENCY{500};
};

agrpc::run<RunTraits>(...);

Btw, this is also the reason why this function is still considered experimental.

You may also find this article from the documentation helpful: https://tradias.github.io/asio-grpc/md_doc_using_asio_io_context.html

I am confused now, so i still need to add work_guard ?? pls clarify.
The default latency is not working for me, the core utilization is shooting 100% , this started the investigation in first place.

Work tracking and the related work_guard are features coming from Asio itself. Asio-grpc supports them in exactly the same way. If want to learn a bit more about this concept you might want to check the Asio documentation, for example this tutorial:

The asio library provides a guarantee that completion handlers will only be called from threads that are currently calling io_context::run(). Therefore unless the io_context::run() function is called the completion handler for the asynchronous wait completion will never be invoked.

The io_context::run() function will also continue to run while there is still "work" to do. In this example, the work is the asynchronous wait on the timer, so the call will not return until the timer has expired and the completion handler has returned.

It is important to remember to give the io_context some work to do before calling io_context::run(). For example, if we had omitted the above call to steady_timer::async_wait(), the io_context would not have had any work to do, and consequently io_context::run() would have returned immediately.

Regarding the CPU consumption: Ok, here you can either adjust the MAX_LATENCY of the RunTraits or see whether the Implicit io_context or running io_context and grpc_context on different threads would suit your needs.

I tried out these things as you suggested , but the issue still persists.
added work guard to the agrpc context as per the client example.
changed the MAX_LATENCY from default 250 micoseconds to 1 millisecond.

The core util comes down to normal from 100 when i comment out the

boost::asio::post( ioc,
[&] {
ioc.get_executor().on_work_finished();
agrpc::run ( grpc_context, ioc );
ioc.get_executor().on_work_started();
});

This is just done to isolate the root cause of core shoot up . i will add a thread as recommended and will see if it helps. let me test out the examples and see if i see the same.

Ok, maybe I wasn't 100% clear. The work guard does not affect the CPU utilization caused by agrpc::run in any way. It is only influenced by the MAX_LATENCY. Here you can try to increase the latency like you did or change the threading model. For example to an explicit two-thread model:

asio::io_context io_context{1};
agrpc::GrpcContext grpc_context;

//asio::ip::tcp::socket socket(io_context);
// use socket etc.

std::thread t{[&] { io_context.run(); }};
grpc_context.run();

Or to an implicit two-thread model:

agrpc::GrpcContext grpc_context;

//asio::ip::tcp::socket socket(grpc_context);
// use socket etc.

grpc_context.run();

I think it might help to see a more complete example:

struct RunTraits
{
    static constexpr std::chrono::milliseconds MAX_LATENCY{1};
};

void start(auto& io_context, agrpc::GrpcContext& grpc_context, const asio::ip::port_type tcp_port, auto& stub,
           auto& grpc_context_work_guard)
{
    asio::co_spawn(
        io_context,
        [&, tcp_port]() -> asio::awaitable<void>
        {
            // The two operations below will run concurrently on the same thread.
            using namespace asio::experimental::awaitable_operators;
            co_await (make_grpc_request(grpc_context, stub) && make_tcp_request(tcp_port));
            grpc_context_work_guard.reset();
        },
        asio::detached);
}

int main(int argc, const char** argv)
{
    const auto grpc_port = argc >= 2 ? argv[1] : "50051";
    const auto host = std::string("localhost:") + grpc_port;
    const auto tcp_port = static_cast<asio::ip::port_type>(argc >= 3 ? std::stoul(argv[2]) : 8001);

    example::v1::Example::Stub stub{grpc::CreateChannel(host, grpc::InsecureChannelCredentials())};
    agrpc::GrpcContext grpc_context;
    auto grpc_context_work_guard = asio::make_work_guard(grpc_context);

    // --- Chose one of the following three models:

    // Implicit two-thread model:
    start(grpc_context, grpc_context, tcp_port, stub, grpc_context_work_guard);
    grpc_context.run();

    // Explicit two-thread model:
    asio::io_context io_context{1};
    start(io_context, grpc_context, tcp_port, stub, grpc_context_work_guard);
    std::thread t{[&]
                  {
                      io_context.run();
                  }};
    grpc_context.run();
    t.join();

    // Single-thread model. Adjust MAX_LATENCY as needed:
    asio::io_context io_context{1};
    start(io_context, grpc_context, tcp_port, stub, grpc_context_work_guard);
    asio::post(io_context,
               [&]
               {
                   io_context.get_executor().on_work_finished();
                   agrpc::run<RunTraits>(grpc_context, io_context);
                   io_context.get_executor().on_work_started();
               });
    io_context.run();
}

able to reproduce issue with the example code as well, my usage might be wrong, please take a look and correct.
on client side making io_context wait will trigger the issue making the client to shoot to 100 % core.
just add this one line on both client and server to make io_context loop forever.

on client and sever :

  • std::unique_ptr<boost::asio::io_context::work> work ( new boost::asio::io_context::work ( io_context ) );
    

on server:

in addition to the above make the coroutine loop forever in a while ( 1 )
asio::awaitable handle_grpc_request(agrpc::GrpcContext& grpc_context, example::v1::Example::AsyncService& service)
{
grpc::ServerContext server_context;
example::v1::Request request;
while ( 1 ) { // <------------ while loop added to make it stay.
grpc::ServerAsyncResponseWriterexample::v1::Response writer{&server_context};
if (!co_await agrpc::request(&example::v1::Example::AsyncService::RequestUnary, service, server_context, request,
writer, asio::bind_executor(grpc_context, asio::use_awaitable)))
{
co_return;
}
example::v1::Response response;
response.set_integer(request.integer());
co_await agrpc::finish(writer, response, grpc::Status::OK, asio::bind_executor(grpc_context, asio::use_awaitable));
}
}

once we run both the cilent and server with above modifications , we can see client shooting to 100% core.
I will play with the code you pasted and get back.

I managed to reproduce the issue on my Linux x86_64 machine. The problem is that in your case the grpc_context has already stopped, yet agrpc::run relies on it doing the "sleep" for MAX_LATENCY. I think this can be considered a bug in asio-grpc. For an immediate workaround, you can modify the single-thread model to:

    asio::post(io_context,
               [&]
               {
                   agrpc::run(grpc_context, io_context,
                              [&]
                              {
                                  return grpc_context.is_stopped();
                              });
               });
    io_context.run();

Thanks for making me aware of this issue.

Thank you for confirmation let me test and get back.

Is it possible to prioritize the fix ?? the quick fix didnt help much. ... Thanks

What do you mean by the workaround didn't help what. What exactly did it not help with? In my tests it does not show 100% CPU consumption anymore and the rest of the application behaves as expected.

In my case the 100% util was still there, just to be clear , it was gone in the example app but still i could see it in our app. let me dig more deep and get back. i will paste the stack trace later in the day.

I double checked , here is the stack trace of the code. The agrpc::run() is commented and the cpu hike disappears.

stack trace :
Thread 1 (Thread 0x7f727c571ac0 (LWP 458363) ):
#0 __GI___libc_malloc (bytes=bytes@entry=300) at ./malloc/malloc.c:3299
#1 0x00007f727c6e06a2 in __GI__IO_str_overflow (fp=0x7ffd250ad050, c=58) at ./libio/strops.c:106
#2 0x00007f727c6deeb4 in __GI__IO_default_xsputn (n=, data=, f=) at ./libio/genops.c:399
#3 __GI__IO_default_xsputn (f=0x7ffd250ad050, data=, n=347) at ./libio/genops.c:370
#4 0x00007f727c6c808c in outstring_func (done=30, length=347, string=0x556ecb40d1f0 "{"created":"@1678112971.528365277","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":324,"referenced_errors":[{"created":"@1678112971.528365073","description":"I"..., s=0x7ffd250ad050) at ../libio/libioP.h:947
#5 __vfprintf_internal (s=s@entry=0x7ffd250ad050, format=format@entry=0x7f727d5426d8 "Completion queue next failed: %s", ap=ap@entry=0x7ffd250ad1a0, mode_flags=mode_flags@entry=2) at ./stdio-common/vfprintf-internal.c:1517
#6 0x00007f727c6d915a in __vasprintf_internal (result_ptr=0x7ffd250ad198, format=0x7f727d5426d8 "Completion queue next failed: %s", args=0x7ffd250ad1a0, mode_flags=2) at ./libio/vasprintf.c:57
#7 0x00007f727d4fcbaf in gpr_log () from /lib/x86_64-linux-gnu/libgrpc.so.10
#8 0x00007f727d4df53a in ?? () from /lib/x86_64-linux-gnu/libgrpc.so.10
#9 0x00007f727d5cf15c in grpc_impl::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) () from /lib/x86_64-linux-gnu/libgrpc++.so.1
#10 0x0000556e9c9c5f1d in grpc_impl::CompletionQueue::AsyncNext<gpr_timespec> (this=0x556e9d273df0, tag=0x7ffd250ad4a8, ok=0x7ffd250ad4b0, deadline=...) at /usr/include/grpcpp/impl/codegen/completion_queue_impl.h:198
#11 0x0000556e9c9c5e10 in agrpc::b::detail::get_next_event (cq=0x556e9d273df0, event=..., deadline=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:165
#12 0x0000556e9c9c5804 in agrpc::b::detail::GrpcContextImplementation::handle_next_completion_queue_event (grpc_context=..., deadline=..., invoke=agrpc::b::detail::InvokeHandler::YES) at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:173
#13 0x0000556e9c9c55fd in agrpc::b::detail::GrpcContextImplementation::do_oneagrpc::b::detail::IsGrpcContextStoppedPredicate (grpc_context=..., deadline=..., invoke=agrpc::b::detail::InvokeHandler::YES, stop_predicate=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/detail/grpc_context_implementation.ipp:214
#14 0x0000556e9c9c4c5a in agrpc::b::detail::GrpcContextDoOne::poll (grpc_context=..., deadline=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:165
#15 0x0000556e9c60a427 in agrpc::b::detail::run_impl<agrpc::b::detail::GrpcContextDoOne, agrpc::b::DefaultRunTraits, boost::asio::io_context, main::$_3::operator()() const::{lambda()#1}>(agrpc::b::GrpcContext&, boost::asio::io_context&, main::$_3::operator()() const::{lambda()#1}) (grpc_context=..., execution_context=..., stop_condition=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:192
#16 0x0000556e9c60a2dd in agrpc::b::run<agrpc::b::DefaultRunTraits, boost::asio::io_context, main::$_3::operator()() const::{lambda()#1}>(agrpc::b::GrpcContext&, boost::asio::io_context&, main::$_3::operator()() const::{lambda()#1}) (grpc_context=..., execution_context=..., stop_condition=...) at opensource/base/asio-grpc-2.4.0/src/agrpc/run.hpp:215
#17 0x0000556e9c60a2a7 in main::$_3::operator() (this=0x7ffd250ad7d8) at wsvr.cpp:619
#18 0x0000556e9c60a275 in boost::asio::detail::binder0main::$_3::operator() (this=0x7ffd250ad7d8) at opensource/base/boost_1_81_0/boost/asio/detail/bind_handler.hpp:60
#19 0x0000556e9c60a255 in boost::asio::asio_handler_invoke<boost::asio::detail::binder0main::$_3 > (function=...) at opensource/base/boost_1_81_0/boost/asio/handler_invoke_hook.hpp:88
#20 0x0000556e9c60a22f in boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0main::$_3, main::$_3> (function=..., context=...) at opensource/base/boost_1_81_0/boost/asio/detail/handler_invoke_helpers.hpp:54
#21 0x0000556e9c60a1fd in boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder0main::$_3, main::$_3> (function=..., this_handler=0x7ffd250ad7d8) at opensource/base/boost_1_81_0/boost/asio/detail/bind_handler.hpp:111
#22 0x0000556e9c60a0ed in boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder0main::$_3, boost::asio::detail::binder0main::$_3 > (function=..., context=...) at opensource/base/boost_1_81_0/boost/asio/detail/handler_invoke_helpers.hpp:54
#23 0x0000556e9c60a5f3 in boost::asio::detail::executor_op<boost::asio::detail::binder0main::$_3, std::allocator, boost::asio::detail::scheduler_operation>::do_complete (owner=0x556e9d267f00, base=0x556ecb3f3f50) at opensource/base/boost_1_81_0/boost/asio/detail/executor_op.hpp:70
#24 0x0000556e9c6bbe5e in boost::asio::detail::scheduler_operation::complete (this=0x556ecb3f3f50, owner=0x556e9d267f00, ec=..., bytes_transferred=0) at opensource/base/boost_1_81_0/boost/asio/detail/scheduler_operation.hpp:40
#25 0x0000556e9c6bb7c7 in boost::asio::detail::scheduler::do_run_one (this=0x556e9d267f00, lock=..., this_thread=..., ec=...) at opensource/base/boost_1_81_0/boost/asio/detail/impl/scheduler.ipp:492
#26 0x0000556e9c6bb297 in boost::asio::detail::scheduler::run (this=0x556e9d267f00, ec=...) at opensource/base/boost_1_81_0/boost/asio/detail/impl/scheduler.ipp:210
#27 0x0000556e9c69e3fe in boost::asio::io_context::run (this=0x7ffd250aef88) at opensource/base/boost_1_81_0/boost/asio/impl/io_context.ipp:63
#28 0x0000556e9c6087d1 in main (argc=40, argv=0x7ffd250b2c28) at ...

Ok, in that case I could use a larger code snippets from your app so that I can reproduce the problem. Or alternatively, you could try to add code from your app to the example until you get the 100% CPU consumption.

The stacktrace is nice but it is only a moment in time. Of course the CPU consumption is caused by a tight loop inside agrpc::run. What might be interesting is to look at the state of the execution contexts at this point: https://github.com/Tradias/asio-grpc/blob/v2.4.0/src/agrpc/run.hpp#L191. How much work do they have, are they stopped/shutdown, etc.

Totally agreed !!! thanks for the pointers, let me dig a bit deep and get back.

Infinite loop no longer seen after the new post handler.
asio::post(io_context, [&] {
agrpc::run(grpc_context, io_context, [&] {
return grpc_context.is_stopped();
});
});