gnzlbg/slice_deque

Poor performance of SliceDeque::<u8>::truncate_front()

Closed this issue · 5 comments

I'm using SliceDeque as a queue for incoming binary data (u8) from network. In a loop packets arrive and line up in the queue (extend), I parse the data from the packets, and truncate_from the relevant portion of the SliceDeque. It works very nice in theory, but in practice, I've found out truncate_front is slow on Linux, and really sluggish on Windows.

I've created the following simple benchmark app: https://github.com/dvtomas/rust-slicedeque-bench.

Basically, it creates a SliceDeque, and repeatedly extend()s it, and truncate_front()s it. The number of items to extend and to truncate in each call is slowly increasing by a configurable amount. All this can be run in multiple threads, but the performance can be easily seen even from single thread run.

Here is a log excerpt from a run on my Linux box:

# global_time action thread time[us] grow/truncate_size total_queue_size
5905543 TRUNC 0 4867 216801 11756080
5906024 GROW  0 478 227746 11983826
5910863 TRUNC 0 4834 216901 11766925
5911344 GROW  0 478 227851 11994776
5916198 TRUNC 0 4850 217001 11777775
5916680 GROW  0 478 227956 12005731

The columns are:

  • Time since start of the benchmark [us],
  • action (Grow/Truncate),
  • thread number (always 0 here),
  • time[us] the action took (the most important column, probably),
  • number of bytes we are extending/truncating the queue by
  • total length of the queue after the operation.

You can see that on Linux, the time to truncate is roughly 10x longer than the time to extend. That's not good, but not horrible. Now let's look at a Windows run with the same parameters:

160042189 TRUNC 0 368593 89101 1987030
160042401 GROW  0 174 93661 2080691
160405897 TRUNC 0 363481 89201 1991490
160406118 GROW  0 170 93766 2085256
160771214 TRUNC 0 365085 89301 1995955
160771411 GROW  0 164 93871 2089826
161143143 TRUNC 0 371722 89401 2000425

(The whole log from a run on my friend's Windows computer can be seen here: https://gist.github.com/dvtomas/2968240b9c9be2ca11dcee63dc5abe60)

It takes 174 us to grow a 2MB queue by ~90kb, but 363 ms to remove ~90kb from the front!

This kind of performance makes SliceDeque impractical for my purposes. Is there anything wrong with the way I'm using it? Is this something that could be solved?

@dvtomas thank you for the report. In general, without something like #33, allocating and re-allocating deques (e.g. due to growing) is currently expected to be very slow - probably the slowest part of the API right now (help with #33 would be appreciated).

However, truncate_front does not modify the allocation, and it should be blazing fast. For your SliceDeque<i32> it just modifies two words: (ptr+=offset, len-=offset).

First question: do you have a significantly simpler reproducer ? For example, what's the performance of truncate_from if you allocate the deque ::with_capacity(2Mb), write 1 Mb into it, and then truncate_front(90kb) + insert_front(90kb) in a loop ?

So I have modified the pop_front benchmarks and can reproduce on OSX. It seems that Rust is not optimizing away the call of drop for your i32 element type, so dropping 90kb of memory is looping over all elements, calling drop once.

@dvtomas could you replace your slice_deque dependency with that of this branch #70 (e.g. by using a git dependency) and test if that fixes this issue for you ?

Just tried that, looks perfect. Both on Linux and Windows the truncate_front time is now in order of single microseconds, even in debug profile. Thank you so much!

So I've released a new slice_deque version (0.2.4, a cargo update should pick it up). It turns out that Rust wasn't smart enough to realize that i32 does not have a drop impl, and therefore, the loop calling drop for each element could be removed (since it was doing nothing).

The fix was to just dispatch to the drop method of the sub-slice &mut [from..to] to be dropped (both for truncate_front, and also truncate_back which also had this issue). Internally, that method calls mem::needs_drop::<T>(), which returns false if T does not have a Drop::drop method worth being called, and it completely skips the loop in that case, essentially turning a O(N) operation into O(1).