ros2/rclcpp

Possible regression in rcl preshutdown callbacks - context invalid?

SteveMacenski opened this issue · 11 comments

Bug report

Required Info:

  • Operating System:
    • 24.04
  • Installation type:
    • ROS 2 Jazzy, Rolling
  • Version or commit hash:
    • 28.2.0-1noble.20240430.174609 (from docker)
  • DDS implementation:
    • N/A
  • Client library (if applicable):
    • rclcpp

Description

I've noticed in the Nav2 transition in both unit tests and now at the system level that there appears to be a regression in the handling of the context and the pre-shutdown callbacks starting with the transition to 24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

But I'm also now seeing it on shutdown of Nav2 at the system level now that we have Gazebo Harmonic working with Nav2. We run our rcl preshutdown callbacks to transition the nodes that may not have been properly transitioned to shutdown before Control+C was hit. This is executing as expected with logs:

[component_container_isolated-8] [INFO] [1717089793.781486696] [controller_server]: Running Nav2 LifecycleNode rcl preshutdown (controller_server)
[component_container_isolated-8] [INFO] [1717089793.781516996] [controller_server]: Destroying bond (controller_server) to lifecycle manager.
[component_container_isolated-8] [INFO] [1717089793.781521867] [smoother_server]: Running Nav2 LifecycleNode rcl preshutdown (smoother_server)
[component_container_isolated-8] [INFO] [1717089793.781526590] [smoother_server]: Destroying bond (smoother_server) to lifecycle manager.
...

But then also contain traces from rcl which seem related to publishers failing to do their thing because the context is invalid at the stage of pre-shutdown. We see only publisher failures when we have the system properly cycled down, but if we have the Nav2 lifecycle nodes still active so services are called to transition them, then we see the service eq as well.

[component_container_isolated-8] [INFO] [1717089793.874960887] [controller_server]: Destroying
[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
[component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)
...

I clipped the logs to show one server in particular, but this repeats for all instances.

This can be reproduced in Nav2 using the 24.04/Rolling + the simulator in ros-navigation/navigation2#3634. Drive around for a moment, stop, and hit control+C. We previously had a clean shutdown without crash. It can also be more minimally reproduced in the PR https://github.com/ros-navigation/navigation2/pull/4298/files TEST(WPTest, test_dynamic_parameters) test changes -- where we manually shut things down & reset the object internally to the test to avoid the need for the rcl pre-shutdown to do work, thereby bypassing it

I don't know if its related, but also point out this issue as well discovered which points to issues in spin() exiting cleanly when Control+C is hit so that rclcpp::shutdown or others can be called once the node is kicked out. It seems related, but I can file a separate ticket for it. Similarly to that @tonynajjar has another ticket on spin_some not working the first time its called in some cases.

@SteveMacenski

#2545 can fix that graceful shutdown case, and generated by #2528. (sorry...)

No worries, shit happens. Another good reason Nav2's around using all these bells and whistles as a canary in the coalmine. Only used up like a day on this so not bad at all in the scheme of things. Lovely that there's already a solution! Thanks @fujitatomoya

#2545 fixes LifecycleNode to avoid leaving in unknow state, if the object is destroyed. (originally this is issued #997).

But if the context is gracefully shutdown (e.g deferred signal handler via SIGINT, which is the exactly case here), the context is already shutdown, and that makes the lifecycle node dtor fails to publish the transition (lifecyclenode should be shutdown to avoid unknow state for sensors and devices but fails since context is not valid anymore). #2545 bypass this situation at this moment, shutdown is called only when the context is valid. this is tagged with TODO until we figure out how to address this condition.

#2545 CI is running now, once i have green light, i will merge it to rolling.

BTW: this worked fine those pre-24.04 Rolling (22.04 Rolling didn't exhibit this) so there was something that changed that caused this, it was fully functioning for ~a year prior. I think my understanding of your response is that there's a bit of a catch-22, so mentioning this as a "I think there should be a solution, no?"

@SteveMacenski i think there are 2 problems here,

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
[component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

I am sure these messages are generated by #2528 as explained #2547 (comment), and #2545 can fix it.

note: #2528 IS NOT merged to jazzy. (#2542 staging but merged), so we should not meet these message with jazzy.

24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

these 3 issues are with rmw_fastrtps, right? this looks the same problem with ros2/rmw_fastrtps#761

The second comment is definitely the same as this.

The first is perhaps? Its also another 'things failing on shutdown' example and you see similar logs with the pre-shutdown (the third is a summarized version of the same). I didn't assume it was a Fast-DDS specific problem, but 🤷

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

@SteveMacenski

#2545 has been merged to rolling, the following waning and error should not be printed even with graceful shutdown.

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
[component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

Great! Currently in the middle of other firefighting with getting Nav2 out for Jazzy, but I'll test this out once the nightly docker containers update and I'm between burning buildings. If you're confident that it addresses the issue, feel free to close

@SteveMacenski after #2545 is merged in rolling, are you guys still having the other issue?

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

@fujitatomoya sorry for the delay, the 24.04 & GZ migrations took longer than we expected in Nav2-ville. Is there a docker base container available that contains these updates? If so, I can test this tomorrow and let you know!