Improve DecodeBuffer::repeat performance again

With the Pullrequest from @paolobarbolini the draining of the decodebuffer has improved a lot (because it now has no memmove anymore which was part of the Vec::drain(..)). As a consequence we now have a 2x increase in memcpys in the sequence execution.

It should be possible to get the best of both worlds. I will have to implement a custom ringbuffer though.

Most operations this needs to support are pretty trivial:

push()
extend()
len()/capacity()
get_slices()
clear()
reserve()
as_slices()

The only non-trivial part is the optimization we need in the sequence execution (which is implemented in DecodeBuffer::repeat).
We need a kind of equivalent of Vec's extend_from_within. Which is a bit harder to do for A Ringbuffer.

fn extend_from_within(offset, match_length) {
    // This is expected to not need to do anything after the first few calls. 
    // The decodebuffer should grow to a certain capacity
    // at which it can hold all the decoded data until the next drain.
    self.reserve(match_length);

    let (s1, s2) = self.slices();
    let (match1, match2) = figure_out_which_regions_need_copying(offset, match_length, s1, s2);

    let (f1, f2) = self.free_capacity();
    // Now the compilcated part that I am too lazy to prototype this evening:
    // copy match1 and match2 into f1 and f2. 
    // With the call to reserve these are guaranteed to have enough space to hold them.
    // In the expected case all that is happening here is a memcpy from match1 into f1 and we are done
    //
    // But since this is a Ringbuffer we need to deal with the occasional fractioning of these regions
    // In the worst case this results in 4 memcpys but that should be quite rare
}

I'll too look at this. It sounds like a great weekend project where I can micro optimize the hell out of it via flamegraphs and Compiler Explorer 😄

Sure go for it :)

Might also be worth to get some people from the std lib to look at this when you're done. Vec::extend_from_within was motivated by usecase like this crate. I think they'd accept an implementation of it for vecdequeue too.

https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#288

This is somewhat what I had in mind but I think it solves an even more general problem. I think we can do something simpler. But as an inspiration for edgecases this is very likely a good source ;)

I think this is implementable completly branchless, which would probably be huge!

Maybe checks against zero length copy_nonoverlapping could be better but that is for benchmarks in real-world workloads to show.

    // input: which region do we want to copy
    let len: usize = 0;
    let start: usize = 0;

    // data slices in raw parts
    let s1_ptr: *const u8 = std::ptr::null();
    let s1_len: usize = 0;
    let s2_ptr: *const u8 = std::ptr::null();
    let s2_len: usize = 0;

    debug_assert!(len <= s1_len + s2_len);

    // calc the actually wanted slices in raw parts
    let start_in_s1 = usize::min(s1_len, start);
    let end_in_s1 = usize::min(s1_len, start + len);
    let m1_ptr = unsafe { s1_ptr.add(start_in_s1) };
    let m1_len = end_in_s1 - start_in_s1;

    debug_assert!(end_in_s1 <= s1_len);
    debug_assert!(start_in_s1 <= s1_len);

    let start_in_s2 = start.saturating_sub(s1_len);
    let end_in_s2 = start_in_s2 + (len - m1_len);
    let m2_ptr = unsafe { s2_ptr.add(start_in_s2) };
    let m2_len = end_in_s2 - start_in_s2;

    debug_assert!(start_in_s2 <= s2_len);
    debug_assert!(end_in_s2 <= s2_len);

    debug_assert_eq!(len, m1_len + m2_len);

    // the free slices, must hold: f1_len + f2_len >= m1_len + m2_len
    let f1_ptr: *mut u8 = std::ptr::null_mut();
    let f1_len: usize = 0;
    let f2_ptr: *mut u8 = std::ptr::null_mut();
    let f2_len: usize = 0;

    debug_assert!(f1_len + f2_len >= m1_len + m2_len);

    // calc how many from where bytes go where
    let m1_in_f1 = usize::min(m1_len, f1_len);
    let m1_in_f2 = m1_len - m1_in_f1;
    let m2_in_f1 = usize::min(f1_len - m1_in_f1, m2_len);
    let m2_in_f2 = m2_len - m2_in_f1;

    debug_assert_eq!(m1_len, m1_in_f1 + m1_in_f2);
    debug_assert_eq!(m2_len, m2_in_f1 + m2_in_f2);
    debug_assert!(f1_len >= m1_in_f1 + m2_in_f1);
    debug_assert!(f2_len >= m1_in_f2 + m2_in_f2);
    debug_assert_eq!(len, m1_in_f1 + m2_in_f1 + m1_in_f2 + m2_in_f2);

    debug_assert!((m1_in_f2 > 0) != (m2_in_f1 > 0));

    unsafe {
        f1_ptr.copy_from_nonoverlapping(m1_ptr, m1_in_f1);
        f1_ptr
            .add(m1_in_f1)
            .copy_from_nonoverlapping(m2_ptr, m2_in_f1);

        f2_ptr.copy_from_nonoverlapping(m1_ptr.add(m1_in_f1), m1_in_f2);
        f2_ptr
            .add(m1_in_f2)
            .copy_from_nonoverlapping(m2_ptr.add(m2_in_f1), m2_in_f2);
    }

If I had to guess it's faster to check that the copy isn't zero length than incurring the cost of calling the function, but we'll see. I'll try finding some time to work on it

The function below seems to be compiled into a jumptable, so coming up with a small integer for each case would be enough to stay branchless and get less memcpy invokations!

This could probably be converted to target even smaller integers because c2 and c3 cannot both be 1, but having a range of [0..16) should be small enough. And being smarter would probably involve branching which is what we are trying to avoid ^^

let c1 = m1_in_f1 > 0;
let c2 = m2_in_f1 > 0;
let c3 = m1_in_f2 > 0;
let c4 = m2_in_f2 > 0;
let case = c1 as usize | (c2 as usize << 1) | (c3 as usize << 2) | (c4 as usize << 3);

Then we pick the ones that are possible in a match and mark all others as unreachable to the compiler.

Obviously this is just one of the options we need to benchmark. If a more readable version has similar performance I'd gladly use that instead.

pub fn is_jump_table(x1: *mut usize, x2: *mut usize, x3: *mut usize, y1: *const usize, y2: *const usize, y3: *const usize, len: usize, len2: usize, len3: usize, case: usize){
    unsafe {
        match case {
            0 => {}
            1 => {
                x1.copy_from_nonoverlapping(y1, len);
            }
            2 => {
                x1.copy_from_nonoverlapping(y2, len);
                x2.copy_from_nonoverlapping(y3, len2);
            }
            3 => {
                x1.copy_from_nonoverlapping(y3, len);
                x3.copy_from_nonoverlapping(y1, len2);
            }
            16 => {
                x1.copy_from_nonoverlapping(y1, len);
                x2.copy_from_nonoverlapping(y2, len2);
                x3.copy_from_nonoverlapping(y3, len3);
            }
            _ => { std::hint::unreachable_unchecked() }
        }
    }
}

Considering how good the std VecDeque is I wander how long it'd take to get VecDeque::extend_from_within accepted as an unstable API and subsequently stabilized 🤔. Anyway, I've started improving the implementation and I'll send a PR in the following days

I've sent rust-lang/rust#95904 to start improving VecDeque so that one day we'll be able to go back to it instead of having to rely on our own ring buffer for ever. The improvements are very big, so it should reduce the gap caused by the missing extend_from_within implementation

Nice! Yeah I think it would be nice to use vecdequeue once it has this API. I don't think it will take that long. It is pretty clear what the API needs to do so stabilization should be fast.

Still fun to tinker with this stuff tho :)

I hate when I come up with clever things and benchmarks say no :(

Anyways, at least on my machine and decoding enwik9, just doing the checks is the fastest option.

Coming in at about 5.7x what real zstd is able to do. That program is irritatingly well optimized.

Yeah #[inline(always)] is a bad idea usually (I think I saw you adding those from the GH feed). On the opposite side #[cold] and #[inline(never)] can help a lot sometimes 😃

Yeah at this point I am just playing around.

What worries me is, that

A) There are still bugs in this code
B) This is less performant than the Version implemented with Vec (at least on my machine, with enwik9 ~5.5x vs ~x4.6 times slower)

This is less performant than the Version implemented with Vec (at least on my machine, with enwik9 ~5.5x vs ~x4.6 times slower)

I think part of the blame might be the fact that you make a new allocation instead of extending the existing one.

Before we forget it: this returns a null pointer if there's an error

zstd-rs/src/decoding/ringbuffer.rs

Line 35 in d225dff

let new_buf = unsafe { std::alloc::alloc(new_layout) };

Before we forget it: this returns a null pointer if there's an error

That is checked in the line below. But you are right, this should either panic or return an error in the case that new_buf == ptr::null_mut()

FYI smoltcp has a working ring buffer implementation, too: https://github.com/smoltcp-rs/smoltcp/blob/master/src/storage/ring_buffer.rs

For anyone interested, work on this is happening in the ringbuffer branch

The VecDeque::extend optimization is moving forward. I hope it lands in 1.62 rust-lang/rust#95904. The benchmarks are promising.

EDIT: it will be in tomorrows nightly (2022-04-28)

That's just for the extend, not the extend_from_within operation right?

That's just for the extend, not the extend_from_within operation right?

Right

But it does make VecDeque::extend actually do a memcpy instead of doing a very slow unoptimized byte by byte copy

Benchmarks on a big file with a high compression ratio

zstd-rs commit	nightly version	results	time spent doing memcpy
`f604d48` (`Vec`)	2022-04-28	6.5s	77%
`66d02e4` (`VecDeque`)	2022-04-27	2.8s	40% (including all of the time spent in the non-optimized `Extend` impl)
`66d02e4` (`VecDeque`)	2022-04-28 (thanks to rust-lang/rust#95904)	2.2s	22%
`25cf7a6` #23 (`VecDeque`)	2022-04-28	2.18s	22%
`30e65d7` (custom ringbuffer)	2022-04-28	1.8s	27%
zstd 1.5.2	-	0.5s	Untested

That's very nice! I am actually a bit surprised that VecDeque did not have this optimization yet. I think you did a lot of people a big favor with that.

Nonetheless I'd say your benchmarks speak towards keeping the custom ringbuffer. a 1.2x speedup is pretty significant in my opinion. Do you plan on doing a extend_from_within for VecDeque?

Do you plan on doing a extend_from_within for VecDeque?

Yes. I've already looked into it and it doesn't seem like the implementation would be too hard.

I'm trying to see if there's a way of doing it without having to implement extend_from_within in VecDeque by relying on LLVM to optimize away useless intermediate memcpys but it doesn't work.

Vec: https://rust.godbolt.org/z/Yhes7h66P ✔️ even without having to rely on unsafe Vec APIs (which is an important test because VecDeque doesn't have them)
VecDeque: https://rust.godbolt.org/z/dzKqebr1K ❌ can't convince it in any way to remove the intermediate memcpy

This got me really curious to how the memcpy optimizer works in LLVM. I'll take a look at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp

I may have found a way to convince LLVM, if only it reliably eliminated memcpys of uninit memory 😞

https://rust.godbolt.org/z/sef536cPn

This is a pretty interesting thing: https://github.com/gnzlbg/slice_deque/

I don't think we should use that (because they seem to rely heavily on linux syscalls) but I thought anyone interested in this issue might also be interested in this crate. It looks very cool. I might just out of interest play with it and see how it performs.

I merged the work from the ringbuffer branch into master. It will need more test before I feel comfortable to make a new release with it, but I think (at least for now) this is the way to go.