Newtype impedes vectorization
bluss opened this issue · 9 comments
In the following example, the newtype is not zero-cost in practice since it seems to impede optimizations in llvm. The plain u32 sum vectorizes while Foo(u32) does not. The newtype fold needs 3 times the runtime of the plain u32 fold.
rustc version: rustc 1.1.0-nightly (97d4e76 2015-04-27) (built 2015-04-28)
code (playpen link)
(The code has been updated)
#![crate_type="lib"]
#![feature(test)]
extern crate test;
#[inline(never)]
pub fn folds(x: &[u32]) -> u32 { x.iter().fold(0, |a, &b| a + b) }
#[derive(Copy, Clone)]
pub struct Foo<T>(T);
#[inline(never)]
pub fn folds_foo(x: &[Foo<u32>]) -> Foo<u32> { x.iter().fold(Foo(0), |a, &b| Foo(a.0 + b.0)) }
#[bench]
fn folds1(b: &mut test::Bencher)
{
let xs = test::black_box(vec![1; 1024]);
b.iter(|| {
folds(&xs)
})
}
#[bench]
fn folds2(b: &mut test::Bencher)
{
let xs = test::black_box(vec![Foo(1); 1024]);
b.iter(|| {
folds_foo(&xs)
})
}
bench results vary with compilation setting
// rustc -C opt-level=3 --test
running 2 tests
test folds1 ... bench: 206 ns/iter (+/- 5)
test folds2 ... bench: 609 ns/iter (+/- 6)
// rustc -C opt-level=3 -C target-cpu=corei7-avx --test
running 2 tests
test folds1 ... bench: 131 ns/iter (+/- 1)
test folds2 ... bench: 192 ns/iter (+/- 3)
That second one should be |a, &b|
which is slightly faster but still 3x slower on my machine. Destructuring (|Foo(a), &Foo(b)| Foo(a + b)
) is the same as |a, b|
and slower than |a, &b|
. Also, getting rid of the generic doesn't help.
I agree, that's what I was intending @Stebalien, but it shouldn't affect codegen either. When I tried it doesn't affect benchmarks. Vectorization is cool btw, compiling with corei7-avx decreases the plain u32 fold's runtime even more, increasing the benchmark difference. :-)
Updated code & bench.
Ah goddammit, I actually wrote a patch that unwrapped newtypes, at least for simple cases (which includes this one), but abandoned it because I couldn't see a noticable difference in optimisation/performance.
Not vectorizing isn't actually correct. It just does it less efficiently. That's some puzzle to try to understand.
Triage: Still an issue with rustc 1.9.0-nightly (98f0a9128 2016-03-23)
I feel like this is fixed judging by the bench results below. Please reopen if that's not the case, preferably with a summary of what we should be looking for to close this issue.
Without target-cpu
:
$ ./test --bench
running 2 tests
test folds1 ... bench: 44 ns/iter (+/- 8)
test folds2 ... bench: 45 ns/iter (+/- 2)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
and with:
$ rustc -Copt-level=3 -C target-cpu=corei7-avx --test test.rs
$ ./test --bench
running 2 tests
test folds1 ... bench: 40 ns/iter (+/- 0)
test folds2 ... bench: 40 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
Nice! I can confirm that too.
According to playpen right now, not fixed in stable not fixed in rustc 1.18.0-beta.1 (4dce67253 2017-04-25)
But it is fixed in rustc 1.19.0-nightly (777ee2079 2017-05-01)