rust-lang/rust

Unnecessary copy when constructing arrays from returned arrays?

bwesterb opened this issue · 13 comments

I want to construct a big array by concatenating smaller arrays returned by other functions. As a simple example:

pub fn g(f: &Fn() -> [u64;40]) -> [u64;80] {
    let a = f();
    let b = f();
    let mut ret = [0u64;80];
    ret[0..40].copy_from_slice(&a[..]);
    ret[40..80].copy_from_slice(&b[..]);
    ret
}

Rust nightly generates a call to memcpy.

Is there a way to prevent this memcpy? Am I missing obvious other way to write this function?

Of course one could rewrite the called function f to take a &mut [u64] instead of returning the array, but that removes compile-time checks on the length and introduces bounds checks. Using &mut [u64;40] as an "out" argument solves that problem, but then I don't see a safe way to get two &mut [u64;40] into [u64;80] without using transmute.

(Background: I'm implementing the XMSSMT hash-based signature in Rust, which involves concatenating lots of hashes. The usual Rust hash library returns an array (actually a GenericArray) instead of using a &mut [u64;...] parameter which led me to believe that the copy could be optimised away.)

What you're looking for is commonly known as "named return value optimization". It's currently not implemented in rust but progress is being tracked in #32966.

In the meantime, if you are willing to take an out-param (&mut [u64; 80]), it's not too hard to convert to an array reference with an arbitrary size and offset. I've used arrayref for this purpose in the (distant) past. You would also want to change your function parameter to use an out-param (f: impl Fn(&mut [u64; 40])).

nikic commented

LLVM isn't very good at optimizing away memcpy's. I think it would be possible to eliminate one of the memcpy's by extending call slot optimization to work with a destination that is a GEP of alloca.

nikic commented

Together with a few other changes, https://reviews.llvm.org/D89623 will remove the last memcpy in g1 and g4. It also removes one of the memcpys in g3.

g2 would require mutable noalias to be reenabled.

nikic commented

On beta we generate 1, 2, 2, 1 memcpys. On nightly we generate 1, 2, 1, 0 memcpys. With -Z mutable-noalias we generate 1, 0, 1, 0 memcpys.

Not sure why the one memcpy in the first function is still there. It gets dropped when I run -memcpyopt on the final IR, so presumably a phase ordering issue.

nikic commented

For g the IR after -memcpyopt is:

; Function Attrs: nounwind nonlazybind
define void @_ZN6test111g17h45fb2453d8cc5713E([80 x i64]* noalias nocapture sret([80 x i64]) dereferenceable(640) %0, {}* nonnull align 1 %1, [3 x i64]* noalias nocapture readonly align 8 dereferenceable(24) %2) unnamed_addr #0 {
  %4 = alloca [40 x i64], align 8
  %5 = alloca [40 x i64], align 8
  %6 = bitcast [40 x i64]* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 320, i8* nonnull %6)
  %7 = getelementptr inbounds [3 x i64], [3 x i64]* %2, i64 0, i64 3
  %8 = bitcast i64* %7 to void ([40 x i64]*, {}*)**
  %9 = load void ([40 x i64]*, {}*)*, void ([40 x i64]*, {}*)** %8, align 8, !invariant.load !2, !nonnull !2
  %10 = bitcast [80 x i64]* %0 to [40 x i64]*
  call void %9([40 x i64]* noalias nocapture nonnull sret([40 x i64]) dereferenceable(320) %10, {}* nonnull align 1 %1) #5
  %11 = bitcast [40 x i64]* %4 to i8*
  call void @llvm.lifetime.start.p0i8(i64 320, i8* nonnull %11)
  call void %9([40 x i64]* noalias nocapture nonnull sret([40 x i64]) dereferenceable(320) %4, {}* nonnull align 1 %1) #5
  %12 = bitcast [80 x i64]* %0 to i8*
  %13 = getelementptr i8, i8* %12, i64 320
  call void @llvm.memset.p0i8.i64(i8* align 8 %13, i8 0, i64 320, i1 false)
  %14 = getelementptr inbounds [80 x i64], [80 x i64]* %0, i64 0, i64 40
  %15 = bitcast i64* %14 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(320) %15, i8* nonnull align 8 dereferenceable(320) %11, i64 320, i1 false) #5
  call void @llvm.lifetime.end.p0i8(i64 320, i8* nonnull %11)
  call void @llvm.lifetime.end.p0i8(i64 320, i8* nonnull %6)
  ret void
}

The problem is that we have a clobbering memset which is dead, but will only be eliminated by the following DSE pass.

nikic commented

The memcpy in the first example should be eliminated by llvm/llvm-project@9080444 and llvm/llvm-project@2902bde (possibly the first one isn't needed) by side-stepping the phase ordering issue. MemCpyOpt happened to implement partial DSE for this particular case already.

nikic commented

After #87570 we're now down to 0, 0, 1, 0 memcpys.

nikic commented

I think this is about as fixed as it's going to get, but it may be worthwhile to add some codegen tests for it.

faptc commented

This now regresses in first 3 cases: https://godbolt.org/z/W6ExP7hdc
@rustbot label -E-needs-test

nikic commented

I believe the regression in g2 is actually a correctness fix in that we don't have consensus that the optimization would be valid (cc @RalfJung for another example where we need "spurious store" on &mut for optimization). Previously this worked due to an LLVM bug, which failed to consider whether introducing spurious stores is valid for call slot optimization.

What's the g2 you are referring to? Some comment above also mentions a g1 and a g4. But the issue description only has a g. Does the issue description need updating?

Anyway I added this to rust-lang/unsafe-code-guidelines#133.

nikic commented

@RalfJung It refers to the godbolt links, e.g. https://godbolt.org/z/W6ExP7hdc.

Okay so the issue description is extremely outdated then. That's quite confusing when coming anew to this issue.