Newtype impedes vectorization

Question

Newtype impedes vectorization

bluss opened this issue 9 years ago · 9 comments

In the following example, the newtype is not zero-cost in practice since it seems to impede optimizations in llvm. The plain u32 sum vectorizes while Foo(u32) does not. The newtype fold needs 3 times the runtime of the plain u32 fold.

rustc version: rustc 1.1.0-nightly (97d4e76 2015-04-27) (built 2015-04-28)

code (playpen link)

(The code has been updated)

#![crate_type="lib"]
#![feature(test)]
extern crate test;

#[inline(never)]
pub fn folds(x: &[u32]) -> u32 { x.iter().fold(0, |a, &b| a + b) }

#[derive(Copy, Clone)]
pub struct Foo<T>(T);

#[inline(never)]
pub fn folds_foo(x: &[Foo<u32>]) -> Foo<u32> { x.iter().fold(Foo(0), |a, &b| Foo(a.0 + b.0)) }

#[bench]
fn folds1(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![1; 1024]);
    b.iter(|| {
        folds(&xs)
    })
}

#[bench]
fn folds2(b: &mut test::Bencher)
{
    let xs = test::black_box(vec![Foo(1); 1024]);
    b.iter(|| {
        folds_foo(&xs)
    })
}

bench results vary with compilation setting

// rustc -C opt-level=3 --test

running 2 tests
test folds1 ... bench:       206 ns/iter (+/- 5)
test folds2 ... bench:       609 ns/iter (+/- 6)

// rustc -C opt-level=3 -C target-cpu=corei7-avx --test
running 2 tests
test folds1 ... bench:       131 ns/iter (+/- 1)
test folds2 ... bench:       192 ns/iter (+/- 3)

Answer 1 · 2015-04-29T22:56:18.000Z

Answer 2 · 2015-04-30T10:38:48.000Z

I agree, that's what I was intending @Stebalien, but it shouldn't affect codegen either. When I tried it doesn't affect benchmarks. Vectorization is cool btw, compiling with corei7-avx decreases the plain u32 fold's runtime even more, increasing the benchmark difference. :-)

Answer 3 · 2015-04-30T13:21:11.000Z

Updated code & bench.

Answer 4 · 2015-05-07T04:43:04.000Z

Ah goddammit, I actually wrote a patch that unwrapped newtypes, at least for simple cases (which includes this one), but abandoned it because I couldn't see a noticable difference in optimisation/performance.

Answer 5 · 2015-05-07T09:50:10.000Z

Not vectorizing isn't actually correct. It just does it less efficiently. That's some puzzle to try to understand.

Answer 6 · 2015-05-25T15:19:52.000Z

cc @dotdash if you have time & interest

Answer 7 · 2016-03-25T17:04:51.000Z

Triage: Still an issue with rustc 1.9.0-nightly (98f0a9128 2016-03-23)

Answer 8 · 2017-05-02T12:14:17.000Z

I feel like this is fixed judging by the bench results below. Please reopen if that's not the case, preferably with a summary of what we should be looking for to close this issue.

Without target-cpu:

$ ./test --bench

running 2 tests
test folds1 ... bench:          44 ns/iter (+/- 8)
test folds2 ... bench:          45 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

and with:

$ rustc -Copt-level=3 -C target-cpu=corei7-avx --test test.rs
$ ./test --bench

running 2 tests
test folds1 ... bench:          40 ns/iter (+/- 0)
test folds2 ... bench:          40 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

Answer 9 · 2017-05-02T17:58:04.000Z

Nice! I can confirm that too.

According to playpen right now, not fixed in stable not fixed in rustc 1.18.0-beta.1 (4dce67253 2017-04-25)

But it is fixed in rustc 1.19.0-nightly (777ee2079 2017-05-01)