insufficient performance in the QoS demo using default parameters

Question

insufficient performance in the QoS demo using default parameters

dirk-thomas opened this issue 6 years ago · 26 comments

The scope of this ticket is focus on the performance of the QoS demo with default values between publisher and subscriber with the same RMW implementation. The cross vendor results are only mentioned for completeness / context.

To reproduce run the easiest example from the QoS demo:

ros2 run image_tools cam2image -b
ros2 run image_tools showimage

Which means:

reliable (default)
queue depth: 10 (default)
publish frequency in Hz: 30 (default)
history QoS setting: keep all the samples (default)

The following results show significant differences in the performance depending on which RMW implementation is chosen on the publisher and subscriber side (collected with the default branches on Ubuntu 16.04 on a Lenovo P50). Only the diagonal highlighted with "quality" colors is of interest for now:

Sub \ Pub	FastRTPS	Connext	OpenSplice
FastRTPS	Little stuttering, not smooth	Flawless	Severe increasing lag, multi-seconds within seconds
Connext	One second burst, short hang, repeat	Flawless	One second burst, short hang, repeat
OpenSplice	Flawless	Stuttering	Flawless

When increasing the image size to -x 640 -y 480 the problems become even more apparent:

Sub \ Pub	FastRTPS	Connext	OpenSplice
FastRTPS	Much more stuttering	Hangs for several seconds between bursts	Severe increasing lag, multi-seconds within seconds
Connext	One second burst, short hang, repeat	Flawless	One second burst, short hang, repeat
OpenSplice	Flawless	Smooth but significantly reduced framerate (even on publisher side)	Flawless

The acceptance criteria to resolve this ticket are:

without changing the demo itself or the QoS settings
the FastRTPS publisher and subscriber should demonstrate a "good" performance which is comparable with the other vendors.

PS: please don't suggest variations of the QoS parameters in this thread but keep this ticket focused on this very specific case. Other cases can be considered separately.

Answer 1 · 2018-05-24T16:20:37.000Z

Another data point: a new ThinkPad X1 (i7-8550U) running Ubuntu 18.04. Qualitative performance is the same as reported by @dirk-thomas :

FastRTPS-FastRTPS

320x240 images @ 30 Hz: the "stuttering" is noticeable. CPU load ~7% for both sender and receiver processes.
640x480 images @ 30 Hz, the "stuttering" is quite severe. In this case, CPU load is ~6% for the sender, and ~100% for the receiver.

Connext-Connext

320x240 images @ 30 Hz: flawless. CPU load ~7% for both sender and receiver processes
640x480 images @ 30 Hz: flawless. CPU load ~50% for both sender and receiver processes

Answer 2 · 2018-05-24T18:34:29.000Z

For people testing this today: if you can afford to rebuild a workspace before testing, please consider using the timestamp_console_logging branch from rcutils (and rclcpp if you build the tests). This will allow use to have timestamps reported in the console output and give more quantitative feedback.

To use it:

compile a workspace using the `` branch
set the RCUTILS_CONSOLE_OUTPUT_FORMAT variable before running your exectuables:
export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{name}] [{time}]: {message}"

Example of resulting output:
Publisher

[INFO] [cam2image] [1527186403.826753588]: Publishing image #400
[INFO] [cam2image] [1527186403.860162956]: Publishing image #401
[INFO] [cam2image] [1527186403.893428871]: Publishing image #402
[INFO] [cam2image] [1527186403.926734432]: Publishing image #403
[INFO] [cam2image] [1527186403.960066041]: Publishing image #404
[INFO] [cam2image] [1527186403.993477350]: Publishing image #405

Subscriber:

[INFO] [showimage] [1527186403.829860703]: Received image #400
[INFO] [showimage] [1527186403.862650268]: Received image #401
[INFO] [showimage] [1527186403.895934081]: Received image #402
[INFO] [showimage] [1527186403.929688603]: Received image #403
[INFO] [showimage] [1527186403.963225085]: Received image #404
[INFO] [showimage] [1527186403.996462511]: Received image #405

So I can confidently say that for me Fast-RTPS is smooth and doesnt show latency at 30 Hz (3ms of delay after 400 frames).

Answer 3 · 2018-05-24T18:37:18.000Z

Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something. We could compute first-order statistics (median, std.dev, etc) but those tend to not fully capture the things we worry about, like the worst-case or
"fraction of outliers," etc.

Answer 4 · 2018-05-24T22:49:21.000Z

I tested Fast-RTPS on my Fedora machine as well (all ros2 repos on latest master). The Fast-RTPS -> Fast-RTPS was also very stuttering, just like what @dirk-thomas mentioned in the initial comment.

I'm having some trouble rebuilding on the timestamp_console_logging branch that @mikaelarguedas mentioned, but I'll try a bit more and report back here once I figure it out.

Answer 5 · 2018-05-24T23:56:13.000Z

Seeing the same stuttering with fastrtps on my 16.04 desktop.

These are some screen captures of the performance:

Answer 6 · 2018-05-25T00:45:59.000Z

I finally got the timestamps compiling, and took some data on my machine. First, this is the default of 320x240, and the cam2image process took ~12% of a core and the showimage took ~10% of a core (so I was not CPU bound). Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).

The cam2image publishing data is here: https://gist.github.com/clalancette/def97b55da9c6c2af52f9a6195b4845d , and it shows about what we expect; we publish an image every ~33ms, for a frame rate of around 30Hz.

The showimage subscribe data is here: https://gist.github.com/clalancette/1d6259b52e6679a96c8f3eafb2ffd9de . It shows that the data is being received in a bursty manner. Looking at line 57-82, for instance, we see that it is receiving data at ~33ms, and then at line 83 there is an 'event' that causes the next frame to take ~5 times longer. That causes a backlog of 5 items which are processed quickly (lines 84-88), at which point things get back to normal.

Answer 7 · 2018-05-25T06:19:30.000Z

We were commenting the issue in our daily meeting. We are now working in some performance improvements, including the rmw layer, so it is perfect timing to analyze this case. We will update you guys soon.

Answer 8 · 2018-05-25T07:47:06.000Z

Result of the 640x480 test on the 3 rmw implementations with the following setup:
Dell XPS15, Ubuntu 16.04, send/receive 400 images, image rendering enabled

ros2 run image_tools cam2image -b -x 640 -y 480

640x480:

	Min	Max	Mean	Median	Stddev	fps	sender CPU	receiver CPU
connext	0.0300738811	0.041087389	0.0318081552	0.0316724777	0.0010073057	30.2290740409	50.00%	50.00%
opensplice	0.010027647	0.051132679	0.022188526	0.019367218	0.0093266182	30.2616755364	36.00%	31.00%
fastrtps	0.0020124912	0.1774580479	0.0352770768	0.0078625679	0.0464236557	30.6302253416	22.00%	33.00%

Serious stuttering with FastRTPS, very smooth for the other ones.
Below a graph of the latency for each rmw_implementation.

Answer 9 · 2018-05-25T09:11:41.000Z

I experienced the same behavior as @mikaelarguedas on my 2017 macbook pro (macOS 10.13.3).

I noticed a few behaviors of interest to me:

@ 320x240 I experience smoothness interrupted (seemingly randomly) with stuttering, basically what @mikaelarguedas's plot shows
@ 640x480 I experience smoothness uninterrupted, but a growing latency (sort of what I would expect from reliable + keep all), but notably less or not stuttering
@ 640x480 when I ctrl-c the publisher, the subscriber stops about ~100 sequence numbers short, which seems to match the latency mentioned in the previous point, and it supports the idea that it is just the history building up because the pub/sub cannot keep up (it should be able to though in my opinion)
@ 160x120 I see none of the previous behaviors, it is smooth, low latency, and if I ctrl-c the publisher the subscription receives all sequence numbers including the one last printed by the publisher (basically all the behavior I expect if the system is keeping up)

I experienced similar behaviors on my Windows desktop, not exactly the same, but the same trend (enough to convince me that it's not OS dependent at this point). I can provide timestamped data if needed on any platform.

Answer 10 · 2018-05-25T20:00:40.000Z

@andreaspasternak can you weigh in here please?

Answer 11 · 2018-05-29T10:44:06.000Z

In this branch I've decreased default values for some timers and it improves the responsive at RTPS level. I've noticed it using cam2image and showimage. Can anyone test with this branch?

Answer 12 · 2018-05-29T14:02:38.000Z

Thanks @richiware for looking into it. The behavior on the hotfix branch is much better indeed!

Could you breifly describe what are the tradeoffs being made using the new timer values?

Result on my XP15 using the new branch:

	Min	Max	Mean	Median	Stddev	fps
connext	0.0300700665	0.0410900116	0.0318081648	0.0316700935	0.0010073107	30.2290620882
opensplice	0.0100297928	0.0511300564	0.0221883707	0.019359827	0.0093266318	30.261684248
fastrtps	0.002010107	0.1774599552	0.0352770948	0.0078701973	0.0464236037	30.6302342667
fastrtps_hotfix	0.001853466	0.0287704468	0.0074550388	0.0068049431	0.0458932319	30.2791350882

With the hotfix and for this test, Fast-RTPS has less latency than the other rmw implementations using default values:

Note: I haven't tested other frequencies of message sizes

Answer 13 · 2018-05-29T18:52:38.000Z

The branch definitely shows a better performance with the burger demo. Thank you for the quick update 👍

Could you breifly describe what are the tradeoffs being made using the new timer values?

Yeah, any information what you expect as "side effects" for other use cases would be good to hear.

Answer 14 · 2018-06-03T18:05:47.000Z

Hi all,

Regarding this:

Second, once I took the data I post-processed it with a small python script to calculate the delta between successive publish/receives (I can share that script later if needed).Do you see non-trivial jitter? I'm not sure how to quantify that other than gathering a lot of samples and showing the distribution in a log-log plot or something.

Next week we will release as open source a performance measurement tool for various communication systems. Its core features are:

Logging of performance metrics such as latency, throughput, lost samples and system utilization to a file.
Creating graphical diagrams from the log files using an extensible python script.
Easy extendability to any desired publish/subscribe communication framework.
Extensive command line interface which allows configuration of the number of publishers and subscribers, quality of service and real-time settings, message type, publishing frequency and more.
Supports analysis of inter and intra process and inter-machine communication.

We were working together with eProsima (FastRTPS) to improve performance under real-time conditions for autonomous driving with great success.
This will also improve the performance of the upcoming new RMW implementation for FastRTPS with static type support (https://github.com/eProsima/rmw_fastrtps/tree/static_typesupport)

In the following graphs, note that y-axis is latency in ms and x-axis is the duration of the experiment in seconds.

Using directly FastRTPS on a real-time system with real-time settings shows good results for medium-sized data at 1000 Hz:

Large data also works well at 30 Hz after applying the settings in http://docs.eprosima.com/en/latest/advanced.html#finding-out-system-maximum-values:

We found for example that busy waiting (meaning not using a waitset) leads to a very significant performance degradation due to expensive mutex locking/unlocking.

Using the current dynamic type support of FastRTPS together with ROS 2 does not provide optimal performance, but this will be very likely shortly resolved by the upcoming new RMW implementation for FastRTPS with static type support.

If you like to test other RMW implementations you can do that using the RMW_IMPLEMENTATION environment variable.

We took care to implement the performance test tool real-time capable by:

Testing it extensively under RT-Linux with proper CPU isolation and memory locking
Preallocation all memory.
Carefully using dedicated RT threads which do not use any blocking calls.

Here the latency at 100 Hz with real-time functionality enabled:

Here with without real-time functionality:

Answer 15 · 2018-06-04T16:34:53.000Z

@richiprosima Can you please respond to the above questions? We would like to get this resolved (merged into master) rather sooner than later. Thanks.

Answer 16 · 2018-06-04T17:01:23.000Z

@mikaelarguedas @dirk-thomas I've decreased two QoS values, one for publishers and other for subscribers.

For publisher: i've decreased publisher_attr.times.nackResponseDelay. It specifies the time a publisher waits until response to a subscriber's acknack. It's previous value was 45ms, and in the patch it is 5ms. Its purpose is to reduce the number of retries sending the same requested samples to multiple subscribers.
For subscribers: i've decreased subscriber_attr.times.heartbeatResponseDelay. It specifies the time a subscriber waits until response to a publisher's heartbeat. It's previous value was 116ms, and in the patch it is 5ms. Its purpose is to reduce the burst of acknacks.

Decreasing these values the response to lost data improves. A lost sample takes less time to be forwarded. This is more appreciated in the video scenario because you can see the stuttering. Why data is lost? A video frame is to large to default linux socket buffer size and some RTPS fragments are lost.

This week I want to play more with these values and it will be merge at the end of the week.

Answer 17 · 2018-06-04T17:15:15.000Z

@richiware

Could you briefly describe what are the tradeoffs being made using the new timer values?

Answer 18 · 2018-06-04T19:02:17.000Z

I've updated my comment ↑

Answer 19 · 2018-06-04T19:13:28.000Z

Thanks @richiware for the updated comment!

(sorry if it's obvious) Can you give examples of scenarios where a shorter nackResponseDelay (or heartbeatResponseDelay) would result in worse performance?

As a made up example of the kind of "scenarios" I'm thinking of: Is it possible that in network environments with significant latency, fragments will be considered lost while they may still be on the way, and with a shorter nackResponseDelay we would nack them too early resulting in an increasing number of resend and collisions?

Answer 20 · 2018-06-05T07:05:15.000Z

nackResponseDelay is also used to reduce the heartbeat messages. If several acknack messages, from different subscribers, are received in a short time, they will be answered by one heartbeat instead of one for each acknack.

Answer 21 · 2018-06-11T18:22:16.000Z

It's concerning to me that these parameters are fixed times. It would probably help if they were related to network latency somehow so you don't have to tune them manually for the type of network you are using.

Answer 22 · 2018-06-11T19:36:48.000Z

This week I want to play more with these values and it will be merge at the end of the week.

@richiware As far as I can see the patch hasn't been merged yet. Can you clarify the timeline when that is planned to happen?

Answer 23 · 2018-06-12T05:52:07.000Z

@dirk-thomas , @MiguelCompany and I were working in a new release. We are integrating all new features, and also this patch, in our internal develop branch. I want to release it to master tomorrow Wednesday. Sorry for the inconveniences.

Answer 24 · 2018-06-14T17:25:19.000Z

With the hotfix merged upstream I will go ahead and close this.

Answer 25 · 2018-06-15T12:26:48.000Z

@dirk-thomas for traceability it would be nice to link to the upstream commit where the fix happened.

Answer 26 · 2018-06-15T13:38:26.000Z

Commit that changed the timer values: eProsima/Fast-DDS@201a393