ros2/rclcpp

Waitset not triggering timer when initialized in lifecycle node

Opened this issue · 9 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 24.04
  • Installation type:
    • Jazzy binaries (but reproducible from source, bisected to the commit linked below which introduced the issue)
  • Version or commit hash:
    • I bisected and found the exact commit that introduced this behavior (it was OK before): 5632a09
  • DDS implementation:
    • Cyclone
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

ros2 lifecycle set /minimal_timer configure
ros2 lifecycle set /minimal_timer activate
ros2 run examples_rclcpp_minimal_service service_main

Expected behavior

Since the service callback destroys only timer_, timer2_ should keep running and Hello, world! should keep printing.

Actual behavior

All timers are stopped and no Hello, world! is printed.

Additional information

The following might shed some light into this.

  • If the timer that is erronously not triggered is initialized in the constructor, rather than in a lifecycle transition, the issue does not arise.
  • If the action server is ran before the lifecycle transition (which means timer2_ is only created after timer_ is destroyed) the issue does not arise.
  • If I manually add a || true here making this condition always evaluate to true, all timers are executed, however I added some printing when timers are cleaned up / added and I can see that every iteration a timer gets added and removed, so something odd is going on there.

So I suspect there is something about how timers (and maybe other entities?) are added to wait sets inside lifecycle node transitions that creates this issue.

This definitely sounds like a bug.

If I manually add a || true here making this condition always evaluate to true,

This is going to force the waitset to always be rebuilt, which is not good for performance. It's likely that the bug is somewhere else (in how timers are added?)

I probably won't get a chance to look at this until after roscon, but thank you for the full repro example and detailed steps.

Yes, I agree that this sounds like a bug.

So I suspect there is something about how timers (and maybe other entities?) are added to wait sets inside lifecycle node transitions that creates this issue.

I wouldn't exclude that this is "timers-specific" since timers are handled in a slightly different ways from other entities.
Moreover, lifecycle nodes do not do anything special here, so the same problem should be reproducible in a regular node (executors don't really care about nodes, they only care about callback groups)

lifecycle nodes do not do anything special here

My assumption as well is that something about the execution order from lifecycle nodes is surfacing this, but it's not something that would be unique to lifecycle nodes.

Hi,

I discovered this bug five days ago at fmrico/cascade_lifecycle#14, and now I am solving a problem that is directly related at PlanSys2 (ros/rosdistro#43212 (comment)). I am solving this now by changing the SingleThreadedExecutors to EventsExecutors in some cases and removing the node before destroying it in other cases.

Basically, if I have two nodes in an executor, each one with a timer, if I destroy one of the nodes, the timer callback in the other node is no longer called.

I can investigate more if you want.

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-6th-december-2024-8am-pt/40954/2

I found the bug:

for (; ii < wait_set.size_of_timers(); ++ii) {
if (rcl_wait_set.timers[ii] != nullptr) {
ret = wait_set.timers(ii);

We got a mismatch here between the sizeof the rcl waitset and the rclcpp waitset. Therefore the ready timer in the rcl waitset is never checked. The fix is simple, but it is in exposed template code...

@clalancette Are we breaking ABI if we are patching the templates ?

My initial analysis was not correct. We got a mismatch between the index in the rcl_waitset and the index in the rclcpp waitset. Thats the bug.

@clalancette Are we breaking ABI if we are patching the templates ?

It depends on what we are changing, I think. What I'm going to say here is that we open the PR to fix things against rolling, and get that in. After that we can figure out what we need to do to backport it.

@wjwwood @mjcarroll This bug was introduced by the rewrite of the executor. It looks like there is a prune missing here :

There is a variable that indicates, that we need pruning, but it is never used. Also the interface is not available at this place...