ActionClient Crash: Taking data from action client but nothing is ready
jmachowinski opened this issue ยท 29 comments
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04
- Installation type:
- binaries
- Version or commit hash:
- iron
- DDS implementation:
- Fast-RTPS
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
Start action server
- server shall accept goal, but not terminate it
Start action client
- send goal
wait
Expected behavior
no crash
Actual behavior
after some time we get the exception
terminate called after throwing an instance of 'std::runtime_error' what(): Taking data from action client but nothing is ready
Additional information
Using a multi threaded spinner, but the default callback group, I guess this one is mutual exclusive
Could you prepare a minimal example that one could build and run to reproduce the error? It would also provide clarity on how various callbacks are defined.
I tried to create a minimal example, and figured out, that the cause of this issue was most likely due to deadlock issues in the client or the server. I can not reproduce the issue as original described in the minimal example.
However I managed reproduce the error now with a slightly different setup.
https://github.com/jmachowinski/rclcppActionDeadlock branch nothingReady
start server
start the client
strg+c the client
directly restart the client
Server:
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): goal_handle attempted invalid transition from state EXECUTING with event CANCELED, at ./src/rcl_action/goal_handle.c:95
Aborted (core dumped)
client:
terminate called after throwing an instance of 'std::runtime_error'
what(): Taking data from action client but nothing is ready
Aborted (core dumped)
start the client
start the server
client just runs
server directly crashes with
./server
terminate called after throwing an instance of 'std::runtime_error'
what(): Taking data from action server but nothing is ready
Aborted (core dumped)
Note, you need to do this multiple times, about 1 time out of 10 the error will trigger.
Okay, I had a look at the code now and it it pretty obvious that a race condition is ongoing.
In order to fix this, I would need some background information on
rcl_action_server_wait_set_get_entities_ready
or repectitive
rcl_wait
From the documentation of this function, it is unclear to me if it may return multiple events to be ready at once.
@wjwwood You seem to have worked a lot on this code, do you know the answer ?
I dug further into this issue, and from my tests I can see, that rcl_action_server_wait_set_get_entities_ready may return multiple events at one time.
Now I am a bit more lost on how to correctly map this to the waitable interface, as is_ready can not signal that multiple events are ready. The only way I see currently is to keep track of incoming events, as reported events through the waitable interface, which is somewhat cumbersome. Any ideas ?
I can reproduce the following exception on server side.
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): goal_handle attempted invalid transition from state EXECUTING with event CANCELED, at ./src/rcl_action/goal_handle.c:95
Aborted (core dumped)
i believe this is because canceled
is called on goal_handler
in the handle_cancel_
user callback. (cancel request is not accepted yet, that saying the goal is NOT canceling state yet.) after all, rcl_action
detects this as invalid transition as designed, https://design.ros2.org/articles/actions.html.
i think what expected here is, user callback should decide if accepted or not only, then process canceling sequence in the application then call canceled
on that goal_handler
.
btw, the following patch is required to build with rolling
.
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 59c6de8..4c509bc 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,5 +25,9 @@ add_executable(client src/Client.cpp)
ament_target_dependencies(client PUBLIC rclcpp rclcpp_action)
target_link_libraries(client PRIVATE ${cpp_typesupport_target})
+install(TARGETS
+ server
+ client
+ DESTINATION lib/${PROJECT_NAME})
ament_package()
diff --git a/package.xml b/package.xml
index 6b65995..f865ad0 100644
--- a/package.xml
+++ b/package.xml
@@ -9,6 +9,9 @@
<buildtool_depend>ament_cmake</buildtool_depend>
+ <depend>rclcpp</depend>
+ <depend>rclcpp_action</depend>
+
<test_depend>ament_lint_auto</test_depend>
<test_depend>ament_lint_common</test_depend>
<buildtool_depend>rosidl_default_generators</buildtool_depend>
gdb core stacktrace
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140540045682240) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140540045682240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007fd22dc85476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007fd22dc6b7f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007fd22df2dbbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fd22df3924c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fd22df392b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007fd22df3923e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007fd22ec7a8d9 in rclcpp::exceptions::throw_from_rcl_error (ret=2301, prefix="", error_state=0x0,
reset_error=0x7fd22e0ce794 <rcutils_reset_error>) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/exceptions/exceptions.cpp:90
#10 0x00007fd22f1cbc8a in rclcpp_action::ServerGoalHandleBase::_canceled (this=0x7fd1ec001130)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server_goal_handle.cpp:96
#11 0x00005646ab15d534 in rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>::canceled (this=0x7fd1ec001130,
result_msg=std::shared_ptr<deadlock::action::SomeAction_Result_<std::allocator<void> >> (use count 2, weak count 0) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server_goal_handle.hpp:210
#12 0x00005646ab15b683 in ServerNode::cancelGoal (this=0x5646ac6fef60,
goalHandle=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (use count 3, weak count 1) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcppActionDeadlock/src/Server.cpp:59
#13 0x00005646ab1663a5 in std::__invoke_impl<rclcpp_action::CancelResponse, rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
__f=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>,
__t=@0x5646aca6acb0: 0x5646ac6fef60) at /usr/include/c++/11/bits/invoke.h:74
#14 0x00005646ab165043 in std::__invoke<rclcpp_action::CancelResponse (ServerNode::*&)(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), ServerNode*&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > > (
__fn=@0x5646aca6aca0: (rclcpp_action::CancelResponse (ServerNode::*)(ServerNode * const, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)) 0x5646ab15b3da <ServerNode::cancelGoal(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>)
at /usr/include/c++/11/bits/invoke.h:96
#15 0x00005646ab163d10 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::__call<rclcpp_action::CancelResponse, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&, 0ul, 1ul>(std::tuple<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&>&&, std::_Index_tuple<0ul, 1ul>) (
this=0x5646aca6aca0, __args=...) at /usr/include/c++/11/functional:420
#16 0x00005646ab162fa0 in std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()<std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >, rclcpp_action::CancelResponse>(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (this=0x5646aca6aca0)
at /usr/include/c++/11/functional:503
#17 0x00005646ab162353 in std::__invoke_impl<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::__invoke_other, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
__f=...) at /usr/include/c++/11/bits/invoke.h:61
#18 0x00005646ab160929 in std::__invoke_r<rclcpp_action::CancelResponse, std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> > >(std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (__fn=...)
--Type <RET> for more, q to quit, c to continue without paging--
at /usr/include/c++/11/bits/invoke.h:114
#19 0x00005646ab15e724 in std::_Function_handler<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >), std::_Bind<rclcpp_action::CancelResponse (ServerNode::*(ServerNode*, std::_Placeholder<1>))(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >&&) (
__functor=..., __args#0=...) at /usr/include/c++/11/bits/std_function.h:290
#20 0x00005646ab16b335 in std::function<rclcpp_action::CancelResponse (std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >)>::operator()(std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction> >) const (this=0x5646ac84ba60,
__args#0=std::shared_ptr<rclcpp_action::ServerGoalHandle<deadlock::action::SomeAction>> (empty) = {...})
at /usr/include/c++/11/bits/std_function.h:590
#21 0x00005646ab169b66 in rclcpp_action::Server<deadlock::action::SomeAction>::call_handle_cancel_callback (this=0x5646ac84b9b0, uuid=...)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/include/rclcpp_action/server.hpp:439
#22 0x00007fd22f1b6fde in rclcpp_action::ServerBase::execute_cancel_request_received (this=0x5646ac84b9b0,
data=std::shared_ptr<void> (use count 2, weak count 0) = {...}) at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:459
#23 0x00007fd22f1b5be4 in rclcpp_action::ServerBase::execute (this=0x5646ac84b9b0, data=std::shared_ptr<void> (use count 2, weak count 0) = {...})
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp_action/src/server.cpp:299
#24 0x00007fd22ec84aec in rclcpp::Executor::execute_any_executable (this=0x7fff63e3e250, any_exec=...)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:546
#25 0x00007fd22ecafc51 in rclcpp::executors::MultiThreadedExecutor::run (this=0x7fff63e3e250, this_thread_number=9)
at /root/ros2_ws/colcon_ws/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp:100
#26 0x00007fd22ecb1cb7 in std::__invoke_impl<void, void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
__f=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>, __t=@0x5646aca6dab0: 0x7fff63e3e250) at /usr/include/c++/11/bits/invoke.h:74
#27 0x00007fd22ecb1bc9 in std::__invoke<void (rclcpp::executors::MultiThreadedExecutor::*&)(unsigned long), rclcpp::executors::MultiThreadedExecutor*&, unsigned long&> (
__fn=@0x5646aca6da98: (void (rclcpp::executors::MultiThreadedExecutor::*)(rclcpp::executors::MultiThreadedExecutor * const, unsigned long)) 0x7fd22ecafad4 <rclcpp::executors::MultiThreadedExecutor::run(unsigned long)>) at /usr/include/c++/11/bits/invoke.h:96
#28 0x00007fd22ecb1ad2 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0x5646aca6da98, __args=...) at /usr/include/c++/11/functional:420
#29 0x00007fd22ecb1a15 in std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>::operator()<, void>() (this=0x5646aca6da98) at /usr/include/c++/11/functional:503
#30 0x00007fd22ecb19bc in std::__invoke_impl<void, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::__invoke_other, std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#31 0x00007fd22ecb1965 in std::__invoke<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>>(std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)>&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#32 0x00007fd22ecb1906 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:253
#33 0x00007fd22ecb18d6 in std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > >::operator()() (this=0x5646aca6da98) at /usr/include/c++/11/bits/std_thread.h:260
#34 0x00007fd22ecb18b6 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<void (rclcpp::executors::MultiThreadedExecutor::*(rclcpp::executors::MultiThreadedExecutor*, unsigned long))(unsigned long)> > > >::_M_run() (this=0x5646aca6da90) at /usr/include/c++/11/bits/std_thread.h:211
#35 0x00007fd22df672b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#36 0x00007fd22dcd7b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#37 0x00007fd22dd69a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
@jmachowinski i think doc section is not correct, see #2266
@fujitatomoya I agree.
Can someone review #2250 ?
This fixes the 'Taking data from client' issues for us.
I think goal_handle attempted invalid transition
can be resolved with application code, and doc fix #2266.
This fixes the 'Taking data from client' issues for us.
on this one, i cannot reproduce this one, can you elaborate what the root case of this before starting review? that would be really appreciated to understand the problem 1st.
This problem is hard to reproduce, you need a certain workload and a actions running with a multithreaded spinner.
As for the explanation of what is going on, this is a long one...
First we need to look at the multithreaded executor:
rclcpp/rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp
Lines 82 to 100 in e296583
The important point here is that get_next_executable is called with a lock and execute_any_executable(any_exec) is called without the lock.
Now we need to look at the function
rclcpp/rclcpp_action/src/client.cpp
Lines 266 to 274 in e296583
This function will be called by get_next_executable deep under the hood.
The important point here is that the kind of next executable is stored in the global pimpl.
Next is the function
rclcpp/rclcpp_action/src/client.cpp
Line 622 in e296583
This function is called by execute_any_executable. As pointed out above, this call is not protected any more by a lock, and
any other thread may have been calling ClientBase::is_ready in-between.
As one can see, the execute function uses the values in the global pimpl object to determine the kind of action that needs processing. This it the point were the race occurs, as the global pimpl may have been altered by another thread, leading to the "Executing action client but nothing is ready" exception.
if (pimpl_->is_feedback_ready) {
} else if (pimpl_->is_status_ready) {
} else if (pimpl_->is_goal_response_ready) {
} else if (pimpl_->is_result_response_ready) {
} else if (pimpl_->is_cancel_response_ready) {
} else {
throw std::runtime_error("Executing action client but nothing is ready");
}
Note, I also saw races between is_ready and take_data. Which are a mystery to me, as this should be protected by the mutex in the spinner. The patch also addresses these.
can you elaborate what the root case of this before starting review? that would be really appreciated to understand the problem 1st.
@fujitatomoya is the issue now clear to you ?
Seeing a simialr error on a code that was running fine under humble
Hi @jmachowinski , thank you for your digging and elaboration on the issue. We have been encountering it in our project since april, randomly in some 1/10 cases and seemingly dependent on the computer utilization. I am now testing our application with your fix and will let you know once I will be able to determine if it helps.
I hope this issue gets more attention here, dear ROS2 guys, it would really help to speed up the review of the related PR:)
@jmachowinski we ran into the same issue and it's kind of a blocker for us due to its quite consistent reproduction. Your explanation here was very useful to bootstrap the investigation we have started as well however there is one thing we cannot puzzle - originally you talked about runtime exception with a message "Taking data from action client but nothing is ready" (and we get exactly the same error) but your explanation leads to "Executing action client but nothing is ready" which seems to be different (at least we haven't seen that one yet). Does it mean that ones you fixed the race between is_ready()
and take_data()
you started to see execute()
call throwing that runtime exception?
@medvedevigorek Its a race between the tree function. is_ready() sets class members, which represent which data must be taken. In take_data() theses members are actually not cleared, as they are 'misused' to store the type of data that was taken. Later in execute the class members are used to figure out what kind of data was taken, and the class members are cleared. This leads to the three possible failures:
- is_ready might be called in between take_data and execute, which leads to a wrong cast of the taken data -> segfault
- is_ready signals data to be taken, execute intterupts and clears the member, take_data finds no member set and throws "Taking data from action client but nothing is ready" .
- is_ready is called, and a message is ready, take data is called, thread change before execute is called. Second thread jumps in, calls is_ready, and a second message is ready, take data is called. Execute is called (still in the second thread), the class member for the message is cleared. Thread 1 calls execute, and does not find any set class member and throws "Executing action client but nothing is ready"
Note that #2250 is a working fix
Thank you, @jmachowinski , for additional details. After getting the race condition caused by execute()
fixed I continued to experience "Taking data from action client.." runtime error and at this point it wouldn't be caused by any race condition as is_ready()
and take_data()
couldn't be called by different threads of the multi threaded executor. So I kept poking around and found the actual cause for it - take_data()
might get called multiple times for a single message in the queue.
Quick overview of involved functions:
get_next_executable()
first checks if there is a ready executable by callingget_next_ready_executable()
, if none then it callswait_for_work()
and runsget_next_ready_executable()
again.get_next_ready_executable()
gets a next waitable from the memory strategy and callstake_data()
on it. The memory strategy pops next waitable from the triggered waitables list which has an execution slot in its callback group - in case of mutually exclusive group no other callbacks are being executed at the moment.wait_for_work()
clears all handles in the memory strategy, collects all handles from registered nodes (this is when our action client is added), then updates the wait set and finally callsremove_null_handles()
on the memory strategy. The latter checks what waitables are ready by callingis_ready()
on each of them and moves the ready ones to the triggered waitables list.
Now what sequence leads to the multiple calls of take_data()
:
- The first call to
get_next_ready_executable()
returns nothing andwait_for_work()
is called. - It finds that an action client aka waitable has some data to process as its
is_ready()
returned true and it was moved to the triggered waitable list - The second call to
get_next_ready_executable()
doesn't return that waitable as something else is running in its callback group which is set to mutually exclusive mode or some other service was returned - The loop continues either in the same thread or a different thread - it doesn't matter
- Again the first
get_next_ready_executable()
is called and it returns nothing - our waitable is still blocked by the callback group and it's being kept in the triggered list. wait_for_work()
is called again and finds that our waitable is still ready hence it adds it again to the triggered list. Now we have two records for the same waitable.get_next_ready_executable()
finally returns the first waitable from the triggered waitable list which already gottake_data()
invoked.- In the same or another thread either step 5 or step 7 will get the second record of the waitable in question from the triggered list but this time
take_data()
will fail as it doesn't know what message type to read.
So the actual problem must be in AllocatorMemoryStrategy as it blindly adds ready waitables to the triggered list.
rclcpp/rclcpp/include/rclcpp/strategies/allocator_memory_strategy.hpp
Lines 123 to 127 in e296583
I ended up with a simple change to add a waitable only if it's not present in the triggered list and it fixed the problem for me. This is in addition to a change in execute()
method to not depend on class members for a message type.
I also tried your fix and works as well as it makes is_ready()
to read data in place and not in take_data()
anymore hence no duplicates in the triggered list possible.
Not sure what else might be impacted by the flow described above besides action client base - don't know what else uses Waitable interface in rclcpp.
@medvedevigorek Nice debugging. I agree, this a second bug we are hitting here.
I wonder, would not just clearing waitable_triggered_handles_ at the start of remove_null_handles fix this issue ?
@jmachowinski, I was thinking about clearing waitable_triggered_handles_
along with other handles in clear_handles()
as it's called by the executor in the beginning of wait_for_work()
routine. However I am not sure about the contract here as potentially the following is possible: collect_entities()
adds a waitable in the list, it's then moved to waitable_triggered_handles_
but not getting executed, on the next iteration the same waitable is not added into the list as it's not in a group anymore hence it's not executed while it was marked ready for execution on the previous iteration. I am not that familiar with the internals so cannot say for sure if such scenario is even a valid one or not. Something tells it's not but didn't want to have more issues so went with a check before adding a waitable instead of clearing the list. If someone familiar with the matter can confirm that then clearing the triggered list might be a better approach.
on the next iteration the same waitable is not added into the list as it's not in a group anymore
@medvedevigorek when you say this, do you mean it's not in a group (I assumed you mean callback group) anymore because it was deleted? Or for some other reason?
@wjwwood Its not added by collect_entities()
, as the waitable is at this point of time in execution, and therefore filtered out of the waitset.
See
@wjwwood yes, I meant callback group. collect_entities
collects waitables from callback groups and later a waitable can be moved to waitable_triggered_handles
list. And the question was if collect_entities
is called again is it possible to not have the same waitable collected again? If it's not possible then clearing the waitable_triggered_handles
might be okay.
Fixed by the merge of #2495 in jazzy
Thank you for your efforts in fixing this. Is there a possibility to backport it to humble? ๐
@alexmillane unfortunately i do not think we can do that, since #2495 is API/ABI break change.
Its pretty much straight forward to modify the patch to be non ABI/API breaking. @alexmillane feel free to do it ;-)
Looking at #2250 it seems this was applied to rolling/jazzy. Has anyone tried to apply to iron? Does the patch apply cleanly? Thanks!
PS. Or can the (patched) rolling rclcpp package be used with an iron distro?
@jmachowinski Could you eleborate how? I am willing to put in the effort into backporting this patch to iron as it basically makes it impossible to run any ros2control application for a longer period when using any controller with an action interface.
@jmachowinski Could you eleborate how? I am willing to put in the effort into backporting this patch to iron as it basically makes it impossible to run any ros2control application for a longer period when using any controller with an action interface.
Use the patch from #2495 and change it, so that the signature of
execute_goal_request_received(const std::shared_ptr<void> & data);
execute_cancel_request_received(const std::shared_ptr<void> & data);
execute_result_request_received(const std::shared_ptr<void> & data);
and the enum
EntityType
are not changed. Should be straight forward.
Hello @alexmillane and @doisyg, I just backported the fix from iron
branch to humble
and created the PR #2635