ros2/rmw_cyclonedds

[Iron][nav2] error: Failed to find a free participant index for domain 0

doisyg opened this issue · 26 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • latest Iron binaries
  • DDS implementation:
    • rmw_cyclonedds_cpp

Steps to reproduce issue

With export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp, start the standard navigation stack

ros2 launch nav2_bringup navigation_launch.py

Expected behavior

Same as for Humble

Actual behavior

Multiple errors:

[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [controller_server-1]: process started with pid [2561141]
[INFO] [smoother_server-2]: process started with pid [2561143]
[INFO] [planner_server-3]: process started with pid [2561145]
[INFO] [behavior_server-4]: process started with pid [2561147]
[INFO] [bt_navigator-5]: process started with pid [2561149]
[INFO] [waypoint_follower-6]: process started with pid [2561151]
[INFO] [velocity_smoother-7]: process started with pid [2561153]
[INFO] [lifecycle_manager-8]: process started with pid [2561155]
[lifecycle_manager-8] 1685658158.898666 [0] lifecycle_: Failed to find a free participant index for domain 0
[lifecycle_manager-8] [ERROR] [1685658158.898760137] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[lifecycle_manager-8]
[lifecycle_manager-8] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[lifecycle_manager-8] This error state is being overwritten:
[lifecycle_manager-8]
[lifecycle_manager-8] 'error not set, at ./src/rcl/node.c:263'
[lifecycle_manager-8]
[lifecycle_manager-8] with this new error message:
[lifecycle_manager-8]
[lifecycle_manager-8] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[lifecycle_manager-8]
[lifecycle_manager-8] rcutils_reset_error() should be called after error handling to avoid this.
[lifecycle_manager-8] <<<
[lifecycle_manager-8] [ERROR] [1685658158.898851751] [rcl]: Failed to fini publisher for node: 1
[lifecycle_manager-8] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[lifecycle_manager-8] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[velocity_smoother-7] [INFO] [1685658158.900652561] [velocity_smoother]:
[velocity_smoother-7] velocity_smoother lifecycle node launched.
[velocity_smoother-7] Waiting on external lifecycle transitions to activate
[velocity_smoother-7] See https://design.ros2.org/articles/node_lifecycle.html for more information.
[planner_server-3] 1685658158.901798 [0] planner_se: Failed to find a free participant index for domain 0
[planner_server-3] [ERROR] [1685658158.901891797] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[planner_server-3]
[planner_server-3] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[planner_server-3] This error state is being overwritten:
[planner_server-3]
[planner_server-3] 'error not set, at ./src/rcl/node.c:263'
[planner_server-3]
[planner_server-3] with this new error message:
[planner_server-3]
[planner_server-3] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[planner_server-3]
[planner_server-3] rcutils_reset_error() should be called after error handling to avoid this.
[planner_server-3] <<<
[planner_server-3] [ERROR] [1685658158.901982797] [rcl]: Failed to fini publisher for node: 1
[planner_server-3] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[planner_server-3] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[controller_server-1] 1685658158.902240 [0] controller: Failed to find a free participant index for domain 0
[controller_server-1] [ERROR] [1685658158.902346717] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[controller_server-1]
[controller_server-1] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[controller_server-1] This error state is being overwritten:
[controller_server-1]
[controller_server-1] 'error not set, at ./src/rcl/node.c:263'
[controller_server-1]
[controller_server-1] with this new error message:
[controller_server-1]
[controller_server-1] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[controller_server-1]
[controller_server-1] rcutils_reset_error() should be called after error handling to avoid this.
[controller_server-1] <<<
[controller_server-1] [ERROR] [1685658158.902473321] [rcl]: Failed to fini publisher for node: 1
[controller_server-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[controller_server-1] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[behavior_server-4] 1685658158.903089 [0] behavior_s: Failed to find a free participant index for domain 0
[behavior_server-4] [ERROR] [1685658158.903185626] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[behavior_server-4]
[behavior_server-4] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[behavior_server-4] This error state is being overwritten:
[behavior_server-4]
[behavior_server-4] 'error not set, at ./src/rcl/node.c:263'
[behavior_server-4]
[behavior_server-4] with this new error message:
[behavior_server-4]
[behavior_server-4] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[behavior_server-4]
[behavior_server-4] rcutils_reset_error() should be called after error handling to avoid this.
[behavior_server-4] <<<
[behavior_server-4] [ERROR] [1685658158.903255114] [rcl]: Failed to fini publisher for node: 1
[behavior_server-4] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[behavior_server-4] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[waypoint_follower-6] [INFO] [1685658158.904875824] [waypoint_follower]:
[waypoint_follower-6] waypoint_follower lifecycle node launched.
[waypoint_follower-6] Waiting on external lifecycle transitions to activate
[waypoint_follower-6] See https://design.ros2.org/articles/node_lifecycle.html for more information.
[smoother_server-2] 1685658158.906080 [0] smoother_s: Failed to find a free participant index for domain 0
[waypoint_follower-6] [INFO] [1685658158.906139702] [waypoint_follower]: Creating
[smoother_server-2] [ERROR] [1685658158.906186368] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[smoother_server-2]
[smoother_server-2] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[smoother_server-2] This error state is being overwritten:
[smoother_server-2]
[smoother_server-2] 'error not set, at ./src/rcl/node.c:263'
[smoother_server-2]
[smoother_server-2] with this new error message:
[smoother_server-2]
[smoother_server-2] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[smoother_server-2]
[smoother_server-2] rcutils_reset_error() should be called after error handling to avoid this.
[smoother_server-2] <<<
[smoother_server-2] [ERROR] [1685658158.906262923] [rcl]: Failed to fini publisher for node: 1
[smoother_server-2] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[smoother_server-2] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[bt_navigator-5] 1685658158.906819 [0] bt_navigat: Failed to find a free participant index for domain 0
[bt_navigator-5] [ERROR] [1685658158.906923740] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[bt_navigator-5]
[bt_navigator-5] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[bt_navigator-5] This error state is being overwritten:
[bt_navigator-5]
[bt_navigator-5] 'error not set, at ./src/rcl/node.c:263'
[bt_navigator-5]
[bt_navigator-5] with this new error message:
[bt_navigator-5]
[bt_navigator-5] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[bt_navigator-5]
[bt_navigator-5] rcutils_reset_error() should be called after error handling to avoid this.
[bt_navigator-5] <<<
[bt_navigator-5] [ERROR] [1685658158.906995792] [rcl]: Failed to fini publisher for node: 1
[bt_navigator-5] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[bt_navigator-5] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421

Additional information

Seems related to https://stackoverflow.com/questions/74232474/ros2-galacitc-failed-to-find-a-free-participant-index-for-domain
So suggested workaround appears to solve the issue with:
export CYCLONEDDS_URI=${CYCLONEDDS_URI:-"<CycloneDDS><Discovery><ParticipantIndex>auto</ParticipantIndex><MaxAutoParticipantIndex>100</MaxAutoParticipantIndex></Discovery></CycloneDDS>"}

Historically, Cyclone's worked great for us out of the box, so it would be great to get that back!

The Cyclone version didn't change from Humble to Iron, but the RMW layer in Iron initializes the configuration a little differently because of the work that went into getting sensible default behaviours. That might affect things — in particular

const bool discovery_off =
obviously plays with these settings.

Whatever is in the environment variable gets appended to the bit that the RMW layer computes and therefore overrides the "RMW" defaults. Setting ParticipantIndex and MaxAutoParticipantIndex like that via CYCLONEDDS_URI should therefore take care of it, but of course the particular command only sets it if it wasn't set before.

What I do find suspicious in the output is that it looks more like hitting Cyclone's default of 9 than like hitting the RMW's (raised) default of 32.

Doing CYCLONEDDS_URI="$CYCLONEDDS_URI,<Tracing><Verbosity>fine</Verbosity><OutputFile>cdds.log.\${CYCLONEDDS_PID</OutputFile></Tracing>" should produce a bunch of logs files, one per process. That would give us the details on the actual settings used, as well as on the number of participants. That might be worth checking.

Lots of log files (that I won't be able to interpret myself), I have enabled that variable only for our navigation stack launch, see attached:
cdds.log.zip
This has been done with also:

export ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

that I won't be able to interpret myself

I know ...

Anyway there's two logs of processes that do start (the two big ones), and there we can see there are 31 processes, 31 port numbers (and it's the port numbers that matter here):

# sed -Ene 's;.*SPDP.*NEW.*udp/([^@]+).*__ProcessName...([^"]*).*__Pid...([^"]*).*;\1 \2 \3;p' cdds.log.2762943|sort  
192.168.229.75:7410 robot_state_publisher 2759180
192.168.229.75:7412 max_height_tf_node 2759184
192.168.229.75:7414 teleop_node 2759188
192.168.229.75:7416 teleop_node 2759190
192.168.229.75:7418 virtual_safety_system_node 2759194
192.168.229.75:7420 twist_mux 2759192
192.168.229.75:7422 velocity_smoother 2759200
192.168.229.75:7424 joy_node 2759186
192.168.229.75:7426 horizontal_3d_flattening_node 2759218
192.168.229.75:7428 lifecycle_manager 2759202
192.168.229.75:7430 lifecycle_manager 2759208
192.168.229.75:7432 horizontal_3d_flattening_node 2759238
192.168.229.75:7434 tower_dynamic_filtering_angle_node 2759214
192.168.229.75:7436 lifecycle_manager 2759198
192.168.229.75:7438 simple_move_node 2759204
192.168.229.75:7440 collision_monitor 2759196
192.168.229.75:7442 tower_dynamic_filtering_angle_node 2759216
192.168.229.75:7444 arri_undock_node 2759206
192.168.229.75:7446 pointclouds_to_merged_scan_node 2759270
192.168.229.75:7448 horizontal_3d_flattening_node 2759244
192.168.229.75:7450 horizontal_3d_flattening_node 2759236
192.168.229.75:7452 parameter_bridge 2759326
192.168.229.75:7454 wiferion_battery_state_from_sim_node 2759337
192.168.229.75:7456 dexory_sim_manager_node 2759329
192.168.229.75:7458 python3 2759212
192.168.229.75:7460 python3 2759182
192.168.229.75:7462 lifecycle_manager 2761014
192.168.229.75:7464 lifecycle_manager 2761018
192.168.229.75:7466 amcl 2761016
192.168.229.75:7468 map_server 2761012
192.168.229.75:7472 velocity_smoother 2762941
# sed -Ene 's;.*SPDP.*NEW.*udp/([^@]+).*__ProcessName...([^"]*).*__Pid...([^"]*).*;\1 \2 \3;p' cdds.log.2762943|sort|wc
      31      93    1489

Given your environment variable quote, one would expect 100.

# grep MaxAutoParticipant cdds.log.2762943
1685714825.615466 [0] lifecycle_: config: Domain/Discovery/MaxAutoParticipantIndex/#text: 32 {0}

So quite simply, it is using the value of 32 set at

config += "<MaxAutoParticipantIndex>32</MaxAutoParticipantIndex>";
and this is insufficient. (The {0} at the end means it found the value only in the first configuration fragment.)

At

config += std::string(config_from_env);
it appends the contents of the environment variable, so I suspect that for some reason or other the environment variable hasn't actually been set to raise it to 100. That's something I'd double check.

Thanks for the detailed answer and investigation!

it appends the contents of the environment variable, so I suspect that for some reason or other the environment variable hasn't actually been set to raise it to 100. That's something I'd double check.

Is it safe to assume that in the meantime, using
export CYCLONEDDS_URI=${CYCLONEDDS_URI:-"<CycloneDDS><Discovery><ParticipantIndex>auto</ParticipantIndex><MaxAutoParticipantIndex>100</MaxAutoParticipantIndex></Discovery></CycloneDDS>"}
will ensure the exact same default behavior as in Humble (that I trust for having extensively tested it in production) ?

Yes, that's a safe assumption.

The Cyclone version didn't change from Humble to Iron

FYI, this is not true. In Humble, we have CycloneDDS version 0.9.x, while in Iron we have 0.10.x. I don't know if that is the cause of this error, or if it comes from the rmw_cyclonedds layer, but it is different.

Oops! Thanks @clalancette for correcting that. I now remember having known it ...

I don't think Cyclone changed in this particular detail from 0.9 to 0.10.

Are there any updates here?

Not really, I assumed that the immediate problem turned out to be easily dealt with using some configuration and kinda forgot about it.

I did yesterday realize I might be able to better in deciding whether to use multicast: there’s something fishy in Linux with loopback over multicast, it seems to work fine but the interface doesn’t state it supports it. What I could is add a trial multicast to see if it works despite it not being advertised. That should allow eliminating this problem.

That said, one has to be careful in using multicast. Quite some effort was spent on figuring out how to prevent unwanted discovery over the network in Iron, and I think everybody is happy with the results. I don’t want to break that … I do have an inkling how to make it all play nicely, by getting some more information from the kernel for receiver packets, but that I definitely should try out first. I do think that this is worth the effort.

The multicast issue on loopback is a separated one. I believe that most of us now know how to workaround (ip link set lo multicast on). It would be nice to not have it, sure, but for now I would prefer to focus on having the exact same behavior as for humble, i.e. having a working setup without configuration (i.e. without export CYCLONEDDS_URI)

Makes sense. I didn't realise the behaviour change (likely) occurred with that ip link set lo multicast on workaround in place, but with that bit of information the puzzle pieces start to make sense. With https://docs.ros.org/en/foxy/Releases/Release-Iron-Irwini.html#improved-discovery-options and the automatic discovery range set to "localhost", the Cyclone RMW layer would in Humble use multicast on the loopback interface (and so no limit on the number of participants) but in Iron maps it to unicast discovery at

disable_multicast = true;
while limiting the number of participants to 32.

IIRC, this change was done after quite some investigation that eventually led to the conclusion that this was the only way within the existing configuration options to work around packets somehow making it from a network interface X into a socket only subscribed to multicast packets on lo. (I find that weird behaviour of the kernel ...)

Bumping the number will work around the problem but, as noted in the comment, it is not ideal.

I intend to do a Cyclone release soonish anyway, and updating rolling to that new release. So let me try to figure out if I can change Cyclone in some way that solves this problem. Then we can worry about backporting it to 0.10.x for Iron or consider moving Iron to that new release.

Great to know that it is being worked on! Thanks

afrixs commented

for those who come here to find a solution for the new participant index issue - to preserve the previous behavior (the one not depending on participant indices) you need to manually set your ParticipantIndex to none, e.g.

export CYCLONEDDS_URI='<CycloneDDS><Domain><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

Note: the multicast-disabling tag was not automatically added in my case for some reason, but maybe you will need to also re-enable multicast, i.e. the whole config could look like:

export CYCLONEDDS_URI='<CycloneDDS><Domain><General><Interfaces><NetworkInterface name="lo"/></Interfaces><AllowMulticast>true</AllowMulticast></General><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

or in a config file (adding for readability):

<CycloneDDS>
  <Domain>
    <General>
      <Interfaces>
        <NetworkInterface name="lo"/>
      </Interfaces>
      <AllowMulticast>true</AllowMulticast>
    </General>
    <Discovery>
      <ParticipantIndex>none</ParticipantIndex>
    </Discovery>
  </Domain>
</CycloneDDS>

@eboasson is there a chance to change this? This seems like a bug for the default settings for Cyclone users

Hi @SteveMacenski, it was changed in rmw_cyclonedds_cpp to support Iron's new discovery modes (specifically, to prevent a random multicast discovery packet from Ethernet wiggling its way into a loopback interface). Simply changing the default in Cyclone will break Iron's new behaviour, but I'm working on a change to Cyclone that makes it possible to support Iron's behaviour while using multicast on loopback. I have the code to extract the data I need from the kernel on all three platforms, now I "only" need to use it.

The plan then is to do a backport to 0.10 so Iron (and Humble) can get easy access to it.

doisyg commented

Friendly ping ;)

@doisyg pong!

Have mercy or faith or, ideally, both ...

I just have too many things to juggle, but I haven't forgotten it and I will get there ...

doisyg commented

Have mercy or faith or, ideally, both ...
Thanks! I have both ;) Just wanted to be sure it is not forgotten

@eboasson just wanted to also follow up. This is very important for having the same great out of the box behavior Cyclone’s had up to this point! It may seem small from a tech perspective, but its a big deal for users to buy into Cyclone over other options in initial testing!

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/manipulation-wg-moveit-maintainer-meeting-november-30-rescheduled/34686/2

I ran into the same issue. The solution presented here (#458 (comment)) worked for me apart from that I had to manually enable multicast on the loopback interface.

So for the sake of completeness if someone is looking for a solution:

export CYCLONEDDS_URI='<CycloneDDS><Domain><General><Interfaces><NetworkInterface name="lo"/></Interfaces><AllowMulticast>true</AllowMulticast></General><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

And manually enable multicast for the loopback interface

sudo ifconfig lo multicast
sudo route add -net 224.0.0.0 netmask 240.0.0.0 dev lo