CycloneDDS hangs when the network interface switches to loopback
amalnanavati opened this issue · 6 comments
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04.4 (jammy) LTS
- Installation type:
- binary
- Version or commit hash:
- 1.3.4-1jammy.20240217.062033
- DDS implementation:
- CycloneDDS
- Client library (if applicable):
ros2cli
Steps to reproduce issue
- Start with the Wired and Wireless internet turned off.
- Start with the ros2 daemon not running (i.e., run
ros2 daemon stop
). - Set CycloneDDS as your RMW:
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
- Run
ros2 topic list
(or any otherros2cli
command) - Turn the Wired network connection on.
- Run
ros2 topic list
(or any otherros2cli
command) - Turn the Wired network connection off.
- Run
ros2 topic list
(or any otherros2cli
command)
Expected behavior
All of the ros2cli
commands succeed.
Actual behavior
The final ros2cli
command hangs (I waited 30 secs before terminating it).
Additional information
- I tried the exact steps with FastRTPS (
export RMW_IMPLEMENTATION=rmw_fastrtps_cpp
) and it works as expected, which points to a bug in CycloneDDS. - As evidenced by Step 4, if the ROS2 daemon starts in loopback, it works fine. The issue only occurs when a running ROS2 daemon has connected to a Wired network, and then switches back to loopback.
- I followed the instructions here to enable multicasting on loopback. I verified that multicast is working on loopback with
ros2 multicast receive
andros2 multicast send
. For the purpose of records, I've pasted the output ofifconfig
androute
below:
$ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 2c:f0:5d:0b:a0:2b txqueuelen 1000 (Ethernet)
RX packets 185454 bytes 105045108 (105.0 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 132692 bytes 36348243 (36.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16 memory 0xa3200000-a3220000
lo: flags=4169<UP,LOOPBACK,RUNNING,MULTICAST> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 7830997 bytes 14975734594 (14.9 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 7830997 bytes 14975734594 (14.9 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 lo
@eboasson, Could you please take a look at this issue and make a preliminary root cause analysis or maybe this is a known issue in the CycloneDDS?
It reproduces only with the rmw_cyclonedds
.
@amalnanavati When you mentioned
- Turn the Wired network connection on.
and
- Turn the Wired network connection off.
How did you do that? What commands did you use for that?
I used the "Settings" application in the Ubuntu Desktop GUI. If I recall correctly, in Settings > Wired
there is a button to toggle the wired network interface on and off.
I am not in a situation where I can try it out to check my hypothesis, but I think there's a pretty good chance it is the following.
The default configuration (in 0.10.x, like ROS uses) uses several threads to process the incoming data, and there are two sockets (application data via unicast and same via multicast) that are handled by dedicated threads. Quite simply, they do a simple blocking read on the socket, only waking up when a packet arrives.
When stopping, you have to make sure these threads wake up but there is no guarantee that they will, except when you send a packet to it — and so it does that itself. With the network interface turned off, I suspect this bit no longer works.
For this hypothesis, there's a workaround: set Internal/MultipleReceiveThreads
to false
in the Cyclone config XML or CYCLONEDDS_URI
environment variable (e.g., do export CYCLONEDDS_URI="<Internal><MultipleReceiveThreads>false</></>"
). That way it no longer does any blocking reads, but instead multiplexes over all sockets and has an alternative way of interrupting this wait. If that solves it, then this is a variant of the well-known case of firewalls causing Cyclone to hang on shutdown. (I changed the default in master
because of that annoyance.)
If the above isn't correct, then I suspect it is caused by the more general problem of Cyclone not being good at dealing with changes to the network configuration for the network interfaces it uses. In principle it is straightforward (open/close some sockets when interfaces become available/disappear, update addresses advertised in discovery, update the addresses used for sending), but it is just not doing that yet.
@amalnanavati does @eboasson's reply resolve this issue? 🧇
@sloretz unfortunately I won't have access to the machines I detected this bug on for the coming months.
However, given that this only occurs in a rare scenario (ros2 daemon starts with ethernet, then ethernet disconnects), it is not a blocking bug for us. I mainly wanted to report it upstream to document it, especially in case others experience it.