CycloneDDS hangs when the network interface switches to loopback

Question

CycloneDDS hangs when the network interface switches to loopback

amalnanavati opened this issue 9 months ago · 6 comments

amalnanavati commented 9 months ago

Bug report

Required Info:

Operating System:
- Ubuntu 22.04.4 (jammy) LTS
Installation type:
- binary
Version or commit hash:
- 1.3.4-1jammy.20240217.062033
DDS implementation:
- CycloneDDS
Client library (if applicable):
- ros2cli

Steps to reproduce issue

Start with the Wired and Wireless internet turned off.
Start with the ros2 daemon not running (i.e., run ros2 daemon stop).
Set CycloneDDS as your RMW: export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
Run ros2 topic list (or any other ros2cli command)
Turn the Wired network connection on.
Run ros2 topic list (or any other ros2cli command)
Turn the Wired network connection off.
Run ros2 topic list (or any other ros2cli command)

Expected behavior

All of the ros2cli commands succeed.

Actual behavior

The final ros2cli command hangs (I waited 30 secs before terminating it).

Additional information

I tried the exact steps with FastRTPS (export RMW_IMPLEMENTATION=rmw_fastrtps_cpp) and it works as expected, which points to a bug in CycloneDDS.
As evidenced by Step 4, if the ROS2 daemon starts in loopback, it works fine. The issue only occurs when a running ROS2 daemon has connected to a Wired network, and then switches back to loopback.
I followed the instructions here to enable multicasting on loopback. I verified that multicast is working on loopback with ros2 multicast receive and ros2 multicast send. For the purpose of records, I've pasted the output of ifconfig and route below:

$ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 2c:f0:5d:0b:a0:2b  txqueuelen 1000  (Ethernet)
        RX packets 185454  bytes 105045108 (105.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 132692  bytes 36348243 (36.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16  memory 0xa3200000-a3220000  

lo: flags=4169<UP,LOOPBACK,RUNNING,MULTICAST>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 7830997  bytes 14975734594 (14.9 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7830997  bytes 14975734594 (14.9 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
224.0.0.0       0.0.0.0         240.0.0.0       U     0      0        0 lo

Answer 1 · 2024-04-25T22:41:41.000Z

@eboasson, Could you please take a look at this issue and make a preliminary root cause analysis or maybe this is a known issue in the CycloneDDS?
It reproduces only with the rmw_cyclonedds.

@amalnanavati When you mentioned

Turn the Wired network connection on.

and

Turn the Wired network connection off.

How did you do that? What commands did you use for that?

Answer 2 · 2024-04-25T23:01:09.000Z

I used the "Settings" application in the Ubuntu Desktop GUI. If I recall correctly, in Settings > Wired there is a button to toggle the wired network interface on and off.

Answer 3 · 2024-04-26T05:53:35.000Z

I am not in a situation where I can try it out to check my hypothesis, but I think there's a pretty good chance it is the following.

The default configuration (in 0.10.x, like ROS uses) uses several threads to process the incoming data, and there are two sockets (application data via unicast and same via multicast) that are handled by dedicated threads. Quite simply, they do a simple blocking read on the socket, only waking up when a packet arrives.

When stopping, you have to make sure these threads wake up but there is no guarantee that they will, except when you send a packet to it — and so it does that itself. With the network interface turned off, I suspect this bit no longer works.

For this hypothesis, there's a workaround: set Internal/MultipleReceiveThreads to false in the Cyclone config XML or CYCLONEDDS_URI environment variable (e.g., do export CYCLONEDDS_URI="<Internal><MultipleReceiveThreads>false</></>"). That way it no longer does any blocking reads, but instead multiplexes over all sockets and has an alternative way of interrupting this wait. If that solves it, then this is a variant of the well-known case of firewalls causing Cyclone to hang on shutdown. (I changed the default in master because of that annoyance.)

If the above isn't correct, then I suspect it is caused by the more general problem of Cyclone not being good at dealing with changes to the network configuration for the network interfaces it uses. In principle it is straightforward (open/close some sockets when interfaces become available/disappear, update addresses advertised in discovery, update the addresses used for sending), but it is just not doing that yet.

Answer 4 · 2024-05-10T00:16:43.000Z

@amalnanavati does @eboasson's reply resolve this issue? 🧇

Answer 5 · 2024-05-14T01:27:55.000Z

@sloretz unfortunately I won't have access to the machines I detected this bug on for the coming months.

However, given that this only occurs in a rare scenario (ros2 daemon starts with ethernet, then ethernet disconnects), it is not a blocking bug for us. I mainly wanted to report it upstream to document it, especially in case others experience it.

Answer 6 · 2024-09-02T00:45:55.000Z

@eboasson I faced a similar issue and your suggestion worked. Are there any other consequences of setting Internal/MultipleReceiveThreads to false ? Will it restrict the rate or amount of data that can be received by the system ?