rust-lang/rust

Newtype impedes vectorization

bluss opened this issue · 9 comments

bluss commented

In the following example, the newtype is not zero-cost in practice since it seems to impede optimizations in llvm. The plain u32 sum vectorizes while Foo(u32) does not. The newtype fold needs 3 times the runtime of the plain u32 fold.

rustc version: rustc 1.1.0-nightly (97d4e76 2015-04-27) (built 2015-04-28)

code (playpen link)

(The code has been updated)

#![crate_type="lib"]
#![feature(test)]
extern crate test;

#[inline(never)]
pub fn folds(x: &[u32]) -> u32 { x.iter().fold(0, |a, &b| a + b) }

#[derive(Copy, Clone)]
pub struct Foo<T>(T);

#[inline(never)]
pub fn folds_foo(x: &[Foo<u32>]) -> Foo<u32> { x.iter().fold(Foo(0), |a, &b| Foo(a.0 + b.0)) }

#[bench]
fn folds1(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![1; 1024]);
    b.iter(|| {
        folds(&xs)
    })
}

#[bench]
fn folds2(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![Foo(1); 1024]);
    b.iter(|| {
        folds_foo(&xs)
    })
}

bench results vary with compilation setting

// rustc -C opt-level=3 --test

running 2 tests
test folds1 ... bench:       206 ns/iter (+/- 5)
test folds2 ... bench:       609 ns/iter (+/- 6)

// rustc -C opt-level=3 -C target-cpu=corei7-avx --test
running 2 tests
test folds1 ... bench:       131 ns/iter (+/- 1)
test folds2 ... bench:       192 ns/iter (+/- 3)

That second one should be |a, &b| which is slightly faster but still 3x slower on my machine. Destructuring (|Foo(a), &Foo(b)| Foo(a + b)) is the same as |a, b| and slower than |a, &b|. Also, getting rid of the generic doesn't help.

bluss commented

I agree, that's what I was intending @Stebalien, but it shouldn't affect codegen either. When I tried it doesn't affect benchmarks. Vectorization is cool btw, compiling with corei7-avx decreases the plain u32 fold's runtime even more, increasing the benchmark difference. :-)

bluss commented

Updated code & bench.

Aatch commented

Ah goddammit, I actually wrote a patch that unwrapped newtypes, at least for simple cases (which includes this one), but abandoned it because I couldn't see a noticable difference in optimisation/performance.

bluss commented

Not vectorizing isn't actually correct. It just does it less efficiently. That's some puzzle to try to understand.

bluss commented

cc @dotdash if you have time & interest

bluss commented

Triage: Still an issue with rustc 1.9.0-nightly (98f0a9128 2016-03-23)

I feel like this is fixed judging by the bench results below. Please reopen if that's not the case, preferably with a summary of what we should be looking for to close this issue.

Without target-cpu:

$ ./test --bench

running 2 tests
test folds1 ... bench:          44 ns/iter (+/- 8)
test folds2 ... bench:          45 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

and with:

$ rustc -Copt-level=3 -C target-cpu=corei7-avx --test test.rs
$ ./test --bench

running 2 tests
test folds1 ... bench:          40 ns/iter (+/- 0)
test folds2 ... bench:          40 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
bluss commented

Nice! I can confirm that too.

According to playpen right now, not fixed in stable not fixed in rustc 1.18.0-beta.1 (4dce67253 2017-04-25)

But it is fixed in rustc 1.19.0-nightly (777ee2079 2017-05-01)