Double spin required since 28.2.0

Question

Double spin required since 28.2.0

SteveMacenski opened this issue 5 months ago · 20 comments

SteveMacenski commented 5 months ago

Bug report

Required Info:

Operating System:
- Ubuntu 24.04 (Docker)
Installation type:
- Binaries
Version or commit hash:
- 28.3.1
DDS implementation:
- Fast-DDS, Cyclone DDS both
Client library (if applicable):
- rclcpp

Expected behavior

Spin All, Spin Some work as previously expected

Actual behavior

Spinning some or all requires a warmup spin that doesn't do anything before processing useful work.

Additional information

Nav2's CI broke recently with a unit test on a behavior tree node that I got to looking into yesterday. I was chatting with @clalancette about it in a comment thread and we realized that this is a rclcpp regression.

goal_updater_node spins an internal executor to process a callback function to get some data that has always worked until recently: https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L60. In debugging, I found that it wasn't getting the data the unit test was sending.

Blaming Executor::spin_some_impl(std::chrono::nanoseconds max_duration, bool exhaustive), I see there's been alot of poking around there recently in the last few months, and a PR to fix some other regression in May, which makes me think maybe there's another regression lurking in that corner.

Test A:

  callback_group_executor_.spin_all(std::chrono::milliseconds(200));

Pushing the 50ms up to 200ms still doesn't work

Test B:

  callback_group_executor_.spin_all(std::chrono::milliseconds(50));
  callback_group_executor_.spin_all(std::chrono::milliseconds(1));

This works

Test C:

   callback_group_executor_.spin_some(std::chrono::milliseconds(50));

Attempting to use spin some instead of spin all also fails

Test D:

Lets try ticking the node multiple times while having the single spin_all(50ms) (which would cause the spin to be called multiple times)

  tree_->rootNode()->executeTick();
  tree_->rootNode()->executeTick();
  tree_->rootNode()->executeTick();

This works

This experiment proves to me that something is odd with spinning behavior right now requiring multiple calls (whether sequentially or as part of the broader application loop) to process data. The first run is having no effect. I see this running in Nav2's CI that is running Cyclone and I can replicate this locally in a Docker container with Fast-DDS, so I think its a ROS 2 side issue, not DDS.

Answer 1 · 2024-07-25T17:47:13.000Z

Thanks for reporting, will take a look. Definitely sounds fishy.

Answer 2 · 2024-07-25T20:29:10.000Z

I wonder if this issue is related to the use of the notifier waitable?
If you add entities to a node that has already been added to an executor:

before: at the beginning of each spin cycle, the executor was looping over all the entities in the nodes and adding them to the waitset
now: adding new entities is done only on demand (when needed). the notifier waitable will appear as "ready" if something has changed, and executing them will refresh the list of entities

so with the new approach, when you call spin_* the only item of work ready is the waitable to refresh the entities (even if those entities already have work to do), calling spin a second time will now act on the updated waitset (which includes the new entities and will thus see their items of work)

The purpose of this change is that the previous behavior was a huge CPU bottleneck.

Answer 3 · 2024-07-25T20:32:08.000Z

IMO this "bug" is expected behavior with the current code(although it's a breaking change and we didn't advertise it...).

See how in the unit-tests we have a spin_all or multiple spin_once to "consume waitable events" before starting the actual test:
https://github.com/ros2/rclcpp/blob/rolling/rclcpp/test/rclcpp/executors/test_events_executor.cpp#L142-L171

The only ways that I can think of to fix it, consist of treating the notifier waitable as a "special entity", so that spin_some and spin_all go back to spinning if that's triggered, but I don't really like it.

Answer 4 · 2024-07-26T04:35:33.000Z

is expected behavior with the current code

That seems like potentially an oversight. For subscribers on regular data, its not a huge deal IMO to drop the first message while the system is 'warming up' (i.e. first lidar scan), but many subscriptions are the result of infrequent or irregular events (like button pushes) and missing the first of an event class is comparatively bad to be the standard ROS 2 out-of-the-box policy.

See how in the unit-tests we have a spin_all or multiple spin_once to "consume waitable events" before starting the actual test

I don't think that anything about these need to change, but it would be good to have a unit test dedicated to regressions on 'first-spinning' events. In effect, that's what we caught in Nav2

Answer 5 · 2024-07-26T06:32:03.000Z

No events are lost.
Indeed if using spin you won't notice any difference.

The only difference is that if you call spin_once or the other variations that collect events only once, the events may be processed in the following iteration if they are originated from a newly added entity (this assuming that you are calling spin_once in a loop).

The "warm up" spin is needed only for entities that are added after the node / cb group is added to the executor.
If multiple entities are added, a single spin iteration will setup them all.

In your example, if you create the subscription before adding the callback group to the executor you should solve the issue (for example moving this line to the end of the constructor https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L40)

Answer 6 · 2024-07-26T16:42:19.000Z

I am not sure how to fix this... but wanted to post the comment for the direction we take...

The "warm up" spin is needed only for entities that are added after the node / cb group is added to the executor.
If multiple entities are added, a single spin iteration will setup them all.

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

Answer 7 · 2024-07-26T17:46:34.000Z

This only happens when spin_xxx is not called in a loop AND entities have been created after the node / cb group was added to the executor AND the user has an expectation on some events to be processed.
IMO this is not a common situation, as spin_xxx should usually be called in a loop and the user doesn't have such expectations.

The problem affect mostly unit-tests where developers are trying to synchronize some operations: I do X, I spin and expect to get the message.
And as mentioned in the previous message, this is easily solvable in unit-tests such as the one posted by @SteveMacenski by simply adding the cb group to the executor AFTER the entity has been created.

The "issue" is that executors have internal events and calling spin_xxx only once may execute those events rather than the user-defined ones.

Answer 8 · 2024-07-26T17:50:59.000Z

After writing that message, I went to read the documentation of the spin_xxx functions, and maybe there is actually a bug.

spin_some

/// Collect work once and execute all available work, optionally within a max duration.

spin_all

/// Collect and execute work repeatedly within a duration or until no more work is available.

spin_once

/// Collect work once and execute the next available work, optionally within a duration.

spin_all should collect the work multiple times (and thus see the new entities, assuming that it's given enough spin duration).
The behavior I described above should only apply to those variants of spin that collect work once.

Answer 9 · 2024-07-26T18:00:22.000Z

No events are lost.

That is my verbal mistake. Events are not triggered when attempted to process the first time 👍

In your example, if you create the subscription before adding the callback group to the executor you should solve the issue (for example moving this line to the end of the constructor https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L40)

I could to do that for this case in Nav2, but I agree with the above comment from @fujitatomoya - this is not a good UX for ROS 2 users that would cause some head scratching and frustration (and feed the 'ROS sucks' trolls). That's a pretty subtle detail that doesn't seem like from an application developer's perspective should matter, but I can ack from a developer side why that could happen:

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

Answer 10 · 2024-07-26T18:21:53.000Z

I wrote a unit-test to exercise this behavior and run it across all executors... However:

if I do spin_once or spin_some, I need the warm-up as expected
spin_all works as expected and doesn't require the warm up as it collects the work multiple times.

The only way in which I made it fail was to do spin_all(std::chrono::microseconds(1)) which obviously runs out of time after executing the internal event so it won't even try to collect more work, so it's kind of equivalent to a spin_some

Answer 11 · 2024-07-31T20:02:45.000Z

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-2nd-august-2024/38881/1

Answer 12 · 2024-08-02T08:31:02.000Z

I guess we need to modify wait_for_work in the following way :

Check if current_notify_waitable_ is the only ready thing in the returned waitset
If yes, execute wait_for_work again with the rest of the timeout.

Alternative:
At start of wait for work:

Create a waitset only containing current_notify_waitable_, wait on it with timeout 0
execute current_notify_waitable_ if ready.

Second version sounds cleaner to me.

Answer 13 · 2024-08-02T08:38:18.000Z

Something like this :

  if(notify_waitable_)
  {
    // rcl::Wait
    executors::ExecutorNotifyWaitable::SharedPtr cpy = std::make_shared<executors::ExecutorNotifyWaitable>(*notify_waitable_);

    rclcpp::WaitSet tmp_wait_set_;
    tmp_wait_set_.add_waitable(cpy);
    auto res = tmp_wait_set_.wait(std::chrono::nanoseconds(0));
    if(res.kind() == WaitResultKind::Ready)
    {
        auto & rcl_wait_set = res.get_wait_set().get_rcl_wait_set();
        if (cpy->is_ready(rcl_wait_set)) {
          cpy->execute(cpy->take_data());
        }
    }
  }

I just tested this. This indeed fixes the bug, but some other tests fail because of it...

Answer 14 · 2024-08-05T00:15:39.000Z

I digged into this issue and I found the root cause.
I added a second test in #2591 that can successfully reproduce the problem.
The PR #2595 fixes the problem in the unit-test.

This bug affects the SingleThreadedExecutor and the MultiThreadedExecutor.
The EventsExecutor passes the test.

It doesn't matter how much duration you pass to spin_all, the test will always fail
spin_all doesn't find any ready work, there are two ready "waitables" but neither of them can be executed:
- the notify_waitable_ is not executed because it's not part of the current collection. This is expected.
- the current_notify_waitable_ is not executed because it doesn't have any callback group associate to it (i.e. the callback_group here is a nullptr https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L853). The notify waitable doesn't have any callback group because it's a waitable managed by the executor.
since we don't have any ready work, even if spin_all is supposed to be "exhaustive", it returns immediately without further collecting work.

The current usage of the NotifyWaitable in wait-set based executors requires to fully handle this waitable in a special way as it can't ever be executed in normal ways (besides the aforementioned check, we also enforce elsewhere that every executable must have a callback group https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L457-L459)

This problem can be mitigated if you have multiple entities in your executor, but this is not the case in the Nav2 unit-test, where the notify waitable is the only entity managed by the executor when the test starts (there's also the goal updated subscription but due to the ordering of operations this is not immediately known to the executor).
If the notify waitable is not the only entity with some work to do, then the "exhaustive" condition would have triggered a new wait_for_work.

This implies that the notify waitable is only executed explicitly here https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L744-L748, and never as part of a

get_next_executable()
execute_any_executable()

This lines happens at the end of a wait_for_work, but the update of the entity collection happens only at the beginning of wait_for_work.
That's why in these scenarios we need to call it twice.
For this reason, I think that @jmachowinski proposed fix makes sense, as it fixes the problem by fully handling the notify waitable in a single wait for work iteration.

Answer 15 · 2024-08-05T01:11:14.000Z

The TL;DR is that:

there was a bug in how the waitset-based executors spin_all used the notify waitable and this required a warm up iteration.
warm up iterations were expected with spin_some, but definitely not desirable.

@jmachowinski PR fixes both points for the waitset-based executors: #2595
The events executor does not have the spin_all bug, but has the spin_some limitation.
I opened a draft PR to fix that: #2597.

Nothing should be done for spin_once since this does not guarantee that the single entity that will be executed is the one expected by the user.

We should add a disclaimer about the spin_some limitation in the StaticSingleThreadedExecutor. We already fixed this thing in two different ways for two executors, I would like to not have to do it a third time, especially since this executor should be deprecated.

Answer 16 · 2024-08-15T03:17:27.000Z

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-16th-august-2024/39130/1

Answer 17 · 2024-08-29T15:48:51.000Z

Is there progress here?

Answer 18 · 2024-08-30T14:26:34.000Z

@SteveMacenski I discussed this issue further with @jmachowinski .

The events executor PR has some minor issues to fix before merging it
#2591

We understood the root cause of the problem for the waitset-baser executors.
The PR #2595 requires more work as the approach currently proposed in it , is not the correct way to fix.

I should have some time to work on both PRs next week.

Answer 19 · 2024-08-30T17:51:52.000Z

Ok, thanks!

Answer 20 · 2024-09-01T16:01:26.000Z

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

Fully agree here @fujitatomoya

I should have some time to work on both PRs next week.

Thanks, following closely as I am attempting to migrate our huge code base to Jazzy and we have a lot of spin_some outside loops. I don't want to start thinking about the implications if the behavior changes from Iron to Jazzy, so I will wait for this to be resolved.