Lost messages when a port hast multiple connections (one remote connection)
Opened this issue · 6 comments
Hi,
I'm writing this to have a visible documentation on the issue, the initial explanation including code to reproduce the issue was in a closed PR: #122, which was not a proper solution and raised other issues.
As output Ports are not buffered, all new data is dropped until the data was completely written to the connected input ports.
When a output port has two connections, new writes are only sent when the previous data was sent and the acks were received by corba. This is an issue on remote connections, especially on networks with low bandwidth.
When the low bandwith connection needs longer for the corba ack as the period of the Task, the next message will not be delivered at all. This means not even on the local connection received the message.
So if you connect a remote gui via network, local messages could possibly be lost.
Nevertheless, a simple workaround is to avoid using additional remote connections on output ports, this can be simply achieved by adding a repeater task, which just writes a data type received by an input port to an output port of the same type. This way only the repeater task is losing messages, not the important task with the double connections (original connection + repeater).
As output Ports are not buffered, all new data is dropped until the data was completely written to the connected input ports.
That's only partially true. For remote connections Orocos adds a local buffer per remote connection for output ports and the write calls never block on Corba. For buffer connections this additional local buffer can also hold multiple elements and unless this buffer is full, no samples will be lost. A Corba dispatcher thread (one per component) is then emptying the local buffer. Unless one-way writes are enabled (compile time option since #123) the dispatcher writes one sample at a time and indeed waits for the remote end to acknowledge, which, for low-bandwidth or high-latency connections could result in dropped samples in the output buffer. But local connections should not be affected by that.
Did you actually observe something different?
Hi,
I experienced the problem on master, i guess because #123 is not merged yet.
#123 added the compile-time option to switch to one-way writes, but the Corba dispatcher thread already existed much longer, probably since the beginning of the RTT corba transport plugin. The responsible code is in RemotePorts.cpp:177 and following on master. The dispatcher is triggered in RemoteChannelElement::signal() in RemoteChannelElement.hpp:132 and then calls RemoteChannelElement::transferSamples() to actually empty the buffer and forward to Corba in a non-realtime context.
So any remote connection of an output port should not be able to delay the samples written to another local connection to the same port (or at least not more than an additional local connection) and definitely not directly call into the Corba middleware. If that would be the case, it would be a bug.
I was not able to run your original example from #122 because I am not familiar with orogen and simply running orogen -v oneway_test.orogen
results in this error on my Ubuntu Xenial system using the toolchain-2.9 branches of all toolchain packages, which contain all patches from the respective master branches. Can you either help with that problem or provide CMake code and a deployment script to run the example without orogen?
So any remote connection of an output port should not be able to delay the samples written to another local connection to the same port
Ok, perhaps there is a Information missing: the "local connection" is also a corba connection, but "locally on the same PC". So both connections are using corba.
https://github.com/orocos-toolchain/rtt/blob/master/rtt/transports/corba/RemoteChannelElement.hpp#L342 this corba write blocks until the transmission is complete, so also transferSamples() and signal() is blocking.
If a new signal() is arriving while the old is still running (blocked), i guess it is not executed. I also guess that the port data content is overwritten in this case or am I wrong?
The multi-dispatcher setup is IMO the best solution to segregate domains (UI vs. system, reliable vs unreliable).
On single-host machines, you could also workaround by using the MQ transport, which would not have the same issue.