Unnecessary copy when constructing arrays from returned arrays?
bwesterb opened this issue · 13 comments
I want to construct a big array by concatenating smaller arrays returned by other functions. As a simple example:
pub fn g(f: &Fn() -> [u64;40]) -> [u64;80] {
let a = f();
let b = f();
let mut ret = [0u64;80];
ret[0..40].copy_from_slice(&a[..]);
ret[40..80].copy_from_slice(&b[..]);
ret
}
Rust nightly generates a call to memcpy.
Is there a way to prevent this memcpy? Am I missing obvious other way to write this function?
Of course one could rewrite the called function f
to take a &mut [u64]
instead of returning the array, but that removes compile-time checks on the length and introduces bounds checks. Using &mut [u64;40]
as an "out" argument solves that problem, but then I don't see a safe way to get two &mut [u64;40]
into [u64;80]
without using transmute
.
(Background: I'm implementing the XMSSMT hash-based signature in Rust, which involves concatenating lots of hashes. The usual Rust hash library returns an array (actually a GenericArray
) instead of using a &mut [u64;...]
parameter which led me to believe that the copy could be optimised away.)
What you're looking for is commonly known as "named return value optimization". It's currently not implemented in rust but progress is being tracked in #32966.
In the meantime, if you are willing to take an out-param (&mut [u64; 80]
), it's not too hard to convert to an array reference with an arbitrary size and offset. I've used arrayref
for this purpose in the (distant) past. You would also want to change your function parameter to use an out-param (f: impl Fn(&mut [u64; 40])
).
LLVM isn't very good at optimizing away memcpy's. I think it would be possible to eliminate one of the memcpy's by extending call slot optimization to work with a destination that is a GEP of alloca.
Together with a few other changes, https://reviews.llvm.org/D89623 will remove the last memcpy in g1 and g4. It also removes one of the memcpys in g3.
g2 would require mutable noalias to be reenabled.
On beta we generate 1, 2, 2, 1 memcpys. On nightly we generate 1, 2, 1, 0 memcpys. With -Z mutable-noalias
we generate 1, 0, 1, 0 memcpys.
Not sure why the one memcpy in the first function is still there. It gets dropped when I run -memcpyopt
on the final IR, so presumably a phase ordering issue.
For g
the IR after -memcpyopt
is:
; Function Attrs: nounwind nonlazybind
define void @_ZN6test111g17h45fb2453d8cc5713E([80 x i64]* noalias nocapture sret([80 x i64]) dereferenceable(640) %0, {}* nonnull align 1 %1, [3 x i64]* noalias nocapture readonly align 8 dereferenceable(24) %2) unnamed_addr #0 {
%4 = alloca [40 x i64], align 8
%5 = alloca [40 x i64], align 8
%6 = bitcast [40 x i64]* %5 to i8*
call void @llvm.lifetime.start.p0i8(i64 320, i8* nonnull %6)
%7 = getelementptr inbounds [3 x i64], [3 x i64]* %2, i64 0, i64 3
%8 = bitcast i64* %7 to void ([40 x i64]*, {}*)**
%9 = load void ([40 x i64]*, {}*)*, void ([40 x i64]*, {}*)** %8, align 8, !invariant.load !2, !nonnull !2
%10 = bitcast [80 x i64]* %0 to [40 x i64]*
call void %9([40 x i64]* noalias nocapture nonnull sret([40 x i64]) dereferenceable(320) %10, {}* nonnull align 1 %1) #5
%11 = bitcast [40 x i64]* %4 to i8*
call void @llvm.lifetime.start.p0i8(i64 320, i8* nonnull %11)
call void %9([40 x i64]* noalias nocapture nonnull sret([40 x i64]) dereferenceable(320) %4, {}* nonnull align 1 %1) #5
%12 = bitcast [80 x i64]* %0 to i8*
%13 = getelementptr i8, i8* %12, i64 320
call void @llvm.memset.p0i8.i64(i8* align 8 %13, i8 0, i64 320, i1 false)
%14 = getelementptr inbounds [80 x i64], [80 x i64]* %0, i64 0, i64 40
%15 = bitcast i64* %14 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(320) %15, i8* nonnull align 8 dereferenceable(320) %11, i64 320, i1 false) #5
call void @llvm.lifetime.end.p0i8(i64 320, i8* nonnull %11)
call void @llvm.lifetime.end.p0i8(i64 320, i8* nonnull %6)
ret void
}
The problem is that we have a clobbering memset
which is dead, but will only be eliminated by the following DSE pass.
The memcpy in the first example should be eliminated by llvm/llvm-project@9080444 and llvm/llvm-project@2902bde (possibly the first one isn't needed) by side-stepping the phase ordering issue. MemCpyOpt happened to implement partial DSE for this particular case already.
I think this is about as fixed as it's going to get, but it may be worthwhile to add some codegen tests for it.
This now regresses in first 3 cases: https://godbolt.org/z/W6ExP7hdc
@rustbot label -E-needs-test
I believe the regression in g2 is actually a correctness fix in that we don't have consensus that the optimization would be valid (cc @RalfJung for another example where we need "spurious store" on &mut
for optimization). Previously this worked due to an LLVM bug, which failed to consider whether introducing spurious stores is valid for call slot optimization.
What's the g2 you are referring to? Some comment above also mentions a g1 and a g4. But the issue description only has a g
. Does the issue description need updating?
Anyway I added this to rust-lang/unsafe-code-guidelines#133.
@RalfJung It refers to the godbolt links, e.g. https://godbolt.org/z/W6ExP7hdc.
Okay so the issue description is extremely outdated then. That's quite confusing when coming anew to this issue.