insufficient performance in the QoS demo using default parameters
dirk-thomas opened this issue ยท 26 comments
The scope of this ticket is focus on the performance of the QoS demo with default values between publisher and subscriber with the same RMW implementation. The cross vendor results are only mentioned for completeness / context.
To reproduce run the easiest example from the QoS demo:
ros2 run image_tools cam2image -b
ros2 run image_tools showimage
Which means:
- reliable (default)
- queue depth: 10 (default)
- publish frequency in Hz: 30 (default)
- history QoS setting: keep all the samples (default)
The following results show significant differences in the performance depending on which RMW implementation is chosen on the publisher and subscriber side (collected with the default branches on Ubuntu 16.04 on a Lenovo P50). Only the diagonal highlighted with "quality" colors is of interest for now:
When increasing the image size to -x 640 -y 480
the problems become even more apparent:
The acceptance criteria to resolve this ticket are:
- without changing the demo itself or the QoS settings
- the FastRTPS publisher and subscriber should demonstrate a "good" performance which is comparable with the other vendors.
PS: please don't suggest variations of the QoS parameters in this thread but keep this ticket focused on this very specific case. Other cases can be considered separately.
Another data point: a new ThinkPad X1 (i7-8550U) running Ubuntu 18.04. Qualitative performance is the same as reported by @dirk-thomas :
FastRTPS-FastRTPS
- 320x240 images @ 30 Hz: the "stuttering" is noticeable. CPU load ~7% for both sender and receiver processes.
- 640x480 images @ 30 Hz, the "stuttering" is quite severe. In this case, CPU load is ~6% for the sender, and ~100% for the receiver.
Connext-Connext
- 320x240 images @ 30 Hz: flawless. CPU load ~7% for both sender and receiver processes
- 640x480 images @ 30 Hz: flawless. CPU load ~50% for both sender and receiver processes
For people testing this today: if you can afford to rebuild a workspace before testing, please consider using the timestamp_console_logging
branch from rcutils (and rclcpp if you build the tests). This will allow use to have timestamps reported in the console output and give more quantitative feedback.
To use it:
- compile a workspace using the `` branch
- set the
RCUTILS_CONSOLE_OUTPUT_FORMAT
variable before running your exectuables:
export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{name}] [{time}]: {message}"
Example of resulting output:
Publisher
[INFO] [cam2image] [1527186403.826753588]: Publishing image #400
[INFO] [cam2image] [1527186403.860162956]: Publishing image #401
[INFO] [cam2image] [1527186403.893428871]: Publishing image #402
[INFO] [cam2image] [1527186403.926734432]: Publishing image #403
[INFO] [cam2image] [1527186403.960066041]: Publishing image #404
[INFO] [cam2image] [1527186403.993477350]: Publishing image #405
Subscriber:
[INFO] [showimage] [1527186403.829860703]: Received image #400
[INFO] [showimage] [1527186403.862650268]: Received image #401
[INFO] [showimage] [1527186403.895934081]: Received image #402
[INFO] [showimage] [1527186403.929688603]: Received image #403
[INFO] [showimage] [1527186403.963225085]: Received image #404
[INFO] [showimage] [1527186403.996462511]: Received image #405
So I can confidently say that for me Fast-RTPS is smooth and doesnt show latency at 30 Hz (3ms of delay after 400 frames).
Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something. We could compute first-order statistics (median, std.dev, etc) but those tend to not fully capture the things we worry about, like the worst-case or
"fraction of outliers," etc.
I tested Fast-RTPS on my Fedora machine as well (all ros2 repos on latest master). The Fast-RTPS -> Fast-RTPS was also very stuttering, just like what @dirk-thomas mentioned in the initial comment.
I'm having some trouble rebuilding on the timestamp_console_logging
branch that @mikaelarguedas mentioned, but I'll try a bit more and report back here once I figure it out.
Seeing the same stuttering with fastrtps on my 16.04 desktop.
These are some screen captures of the performance:
I finally got the timestamps compiling, and took some data on my machine. First, this is the default of 320x240, and the cam2image process took ~12% of a core and the showimage took ~10% of a core (so I was not CPU bound). Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).
The cam2image publishing data is here: https://gist.github.com/clalancette/def97b55da9c6c2af52f9a6195b4845d , and it shows about what we expect; we publish an image every ~33ms, for a frame rate of around 30Hz.
The showimage subscribe data is here: https://gist.github.com/clalancette/1d6259b52e6679a96c8f3eafb2ffd9de . It shows that the data is being received in a bursty manner. Looking at line 57-82, for instance, we see that it is receiving data at ~33ms, and then at line 83 there is an 'event' that causes the next frame to take ~5 times longer. That causes a backlog of 5 items which are processed quickly (lines 84-88), at which point things get back to normal.
We were commenting the issue in our daily meeting. We are now working in some performance improvements, including the rmw layer, so it is perfect timing to analyze this case. We will update you guys soon.
Result of the 640x480 test on the 3 rmw implementations with the following setup:
Dell XPS15, Ubuntu 16.04, send/receive 400 images, image rendering enabled
ros2 run image_tools cam2image -b -x 640 -y 480
640x480:
Min | Max | Mean | Median | Stddev | fps | sender CPU | receiver CPU | |
---|---|---|---|---|---|---|---|---|
connext | 0.0300738811 | 0.041087389 | 0.0318081552 | 0.0316724777 | 0.0010073057 | 30.2290740409 | 50.00% | 50.00% |
opensplice | 0.010027647 | 0.051132679 | 0.022188526 | 0.019367218 | 0.0093266182 | 30.2616755364 | 36.00% | 31.00% |
fastrtps | 0.0020124912 | 0.1774580479 | 0.0352770768 | 0.0078625679 | 0.0464236557 | 30.6302253416 | 22.00% | 33.00% |
Serious stuttering with FastRTPS, very smooth for the other ones.
Below a graph of the latency for each rmw_implementation.
I experienced the same behavior as @mikaelarguedas on my 2017 macbook pro (macOS 10.13.3).
I noticed a few behaviors of interest to me:
- @ 320x240 I experience smoothness interrupted (seemingly randomly) with stuttering, basically what @mikaelarguedas's plot shows
- @ 640x480 I experience smoothness uninterrupted, but a growing latency (sort of what I would expect from reliable + keep all), but notably less or not stuttering
- @ 640x480 when I ctrl-c the publisher, the subscriber stops about ~100 sequence numbers short, which seems to match the latency mentioned in the previous point, and it supports the idea that it is just the history building up because the pub/sub cannot keep up (it should be able to though in my opinion)
- @ 160x120 I see none of the previous behaviors, it is smooth, low latency, and if I ctrl-c the publisher the subscription receives all sequence numbers including the one last printed by the publisher (basically all the behavior I expect if the system is keeping up)
I experienced similar behaviors on my Windows desktop, not exactly the same, but the same trend (enough to convince me that it's not OS dependent at this point). I can provide timestamped data if needed on any platform.
@andreaspasternak can you weigh in here please?
In this branch I've decreased default values for some timers and it improves the responsive at RTPS level. I've noticed it using cam2image and showimage. Can anyone test with this branch?
Thanks @richiware for looking into it. The behavior on the hotfix branch is much better indeed!
Could you breifly describe what are the tradeoffs being made using the new timer values?
Result on my XP15 using the new branch:
Min | Max | Mean | Median | Stddev | fps | |
---|---|---|---|---|---|---|
connext | 0.0300700665 | 0.0410900116 | 0.0318081648 | 0.0316700935 | 0.0010073107 | 30.2290620882 |
opensplice | 0.0100297928 | 0.0511300564 | 0.0221883707 | 0.019359827 | 0.0093266318 | 30.261684248 |
fastrtps | 0.002010107 | 0.1774599552 | 0.0352770948 | 0.0078701973 | 0.0464236037 | 30.6302342667 |
fastrtps_hotfix | 0.001853466 | 0.0287704468 | 0.0074550388 | 0.0068049431 | 0.0458932319 | 30.2791350882 |
With the hotfix and for this test, Fast-RTPS has less latency than the other rmw implementations using default values:
Note: I haven't tested other frequencies of message sizes
The branch definitely shows a better performance with the burger demo. Thank you for the quick update ๐
Could you breifly describe what are the tradeoffs being made using the new timer values?
Yeah, any information what you expect as "side effects" for other use cases would be good to hear.
Hi all,
Regarding this:
Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something.
Next week we will release as open source a performance measurement tool for various communication systems. Its core features are:
- Logging of performance metrics such as latency, throughput, lost samples and system utilization to a file.
- Creating graphical diagrams from the log files using an extensible python script.
- Easy extendability to any desired publish/subscribe communication framework.
- Extensive command line interface which allows configuration of the number of publishers and subscribers, quality of service and real-time settings, message type, publishing frequency and more.
- Supports analysis of inter and intra process and inter-machine communication.
We were working together with eProsima (FastRTPS) to improve performance under real-time conditions for autonomous driving with great success.
This will also improve the performance of the upcoming new RMW implementation for FastRTPS with static type support (https://github.com/eProsima/rmw_fastrtps/tree/static_typesupport)
In the following graphs, note that y-axis is latency in ms and x-axis is the duration of the experiment in seconds.
Using directly FastRTPS on a real-time system with real-time settings shows good results for medium-sized data at 1000 Hz:
Large data also works well at 30 Hz after applying the settings in http://docs.eprosima.com/en/latest/advanced.html#finding-out-system-maximum-values:
We found for example that busy waiting (meaning not using a waitset) leads to a very significant performance degradation due to expensive mutex locking/unlocking.
Using the current dynamic type support of FastRTPS together with ROS 2 does not provide optimal performance, but this will be very likely shortly resolved by the upcoming new RMW implementation for FastRTPS with static type support.
If you like to test other RMW implementations you can do that using the RMW_IMPLEMENTATION environment variable.
We took care to implement the performance test tool real-time capable by:
- Testing it extensively under RT-Linux with proper CPU isolation and memory locking
- Preallocation all memory.
- Carefully using dedicated RT threads which do not use any blocking calls.
Here the latency at 100 Hz with real-time functionality enabled:
@richiprosima Can you please respond to the above questions? We would like to get this resolved (merged into master) rather sooner than later. Thanks.
@mikaelarguedas @dirk-thomas I've decreased two QoS values, one for publishers and other for subscribers.
- For publisher: i've decreased
publisher_attr.times.nackResponseDelay
. It specifies the time a publisher waits until response to a subscriber's acknack. It's previous value was 45ms, and in the patch it is 5ms. Its purpose is to reduce the number of retries sending the same requested samples to multiple subscribers. - For subscribers: i've decreased
subscriber_attr.times.heartbeatResponseDelay
. It specifies the time a subscriber waits until response to a publisher's heartbeat. It's previous value was 116ms, and in the patch it is 5ms. Its purpose is to reduce the burst of acknacks.
Decreasing these values the response to lost data improves. A lost sample takes less time to be forwarded. This is more appreciated in the video scenario because you can see the stuttering. Why data is lost? A video frame is to large to default linux socket buffer size and some RTPS fragments are lost.
This week I want to play more with these values and it will be merge at the end of the week.
Could you briefly describe what are the tradeoffs being made using the new timer values?
I've updated my comment โ
Thanks @richiware for the updated comment!
(sorry if it's obvious) Can you give examples of scenarios where a shorter nackResponseDelay
(or heartbeatResponseDelay
) would result in worse performance?
As a made up example of the kind of "scenarios" I'm thinking of: Is it possible that in network environments with significant latency, fragments will be considered lost while they may still be on the way, and with a shorter nackResponseDelay
we would nack them too early resulting in an increasing number of resend and collisions?
nackResponseDelay
is also used to reduce the heartbeat messages. If several acknack messages, from different subscribers, are received in a short time, they will be answered by one heartbeat instead of one for each acknack.
It's concerning to me that these parameters are fixed times. It would probably help if they were related to network latency somehow so you don't have to tune them manually for the type of network you are using.
This week I want to play more with these values and it will be merge at the end of the week.
@richiware As far as I can see the patch hasn't been merged yet. Can you clarify the timeline when that is planned to happen?
@dirk-thomas , @MiguelCompany and I were working in a new release. We are integrating all new features, and also this patch, in our internal develop branch. I want to release it to master tomorrow Wednesday. Sorry for the inconveniences.
With the hotfix merged upstream I will go ahead and close this.
@dirk-thomas for traceability it would be nice to link to the upstream commit where the fix happened.
Commit that changed the timer values: eProsima/Fast-DDS@201a393