rust-num/num-complex

Stacked vetorization instructions not used

aplund opened this issue · 5 comments

There are some obvious vectorisation opportunities for instructions that don't seem to be taken.

As an example, consider this trivial test case:

pub extern fn mcf67(a: Complex<f64>, b:Complex<f64>) -> Complex<f64> { a+b }

The optimised LLVM-IR for this function is:

; Function Attrs: norecurse nounwind nonlazybind readnone uwtable
define { double, double } @mcf67({ double, double }, { double, double }) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception", %"unwind::libunwind::_Unwind_Context")* @rust_eh_personality {
start:
%.fca.0.extract = extractvalue { double, double } %0, 0
%.fca.1.extract = extractvalue { double, double } %0, 1
%.fca.0.extract6 = extractvalue { double, double } %1, 0
%.fca.1.extract7 = extractvalue { double, double } %1, 1
%2 = fadd double %.fca.0.extract, %.fca.0.extract6
%3 = fadd double %.fca.1.extract, %.fca.1.extract7
%4 = insertvalue { double, double } undef, double %2, 0
%5 = insertvalue { double, double } %4, double %3, 1
ret { double, double } %5
}

Which for my architecture (skylake) gives optimised assembly:

mcf67: # @mcf67
.cfi_startproc
# BB#0: # %start
vaddsd %xmm2, %xmm0, %xmm0
vaddsd %xmm3, %xmm1, %xmm1
retq

Hand crafting the LLVM-IR using the vector construct of LLVM for the same operation:

; Function Attrs: norecurse nounwind nonlazybind readnone uwtable
define < 2 x double > @mcf69(< 2 x double >, < 2 x double >) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception", %"unwind::libunwind::_Unwind_Context")* @rust_eh_personality {
start:
%2 = fadd < 2 x double > %0, %1
ret < 2 x double > %2
}

This gives assembly using stacked instructions:

mcf69: # @mcf69
.cfi_startproc
# BB#0: # %start
vaddpd %xmm1, %xmm0, %xmm0
retq

The key seems to be the conversion of the Complex data structure in Rust into a LLVM struct instead of aLLVM vector. I'm not sure if this is a LLVM issue, Rust issue or an issue with this library. Perhaps there is some rust magic to tell the compiler to use stacked operations. But it would seem like an unnecessary burden to hand craft SIMD code in rust to do when LLVM can sort out these issues when using their vector constructions.

This example was generated using rustc 1.33.0.

FWIW, the equivalent C also uses addsd / vaddsd depending on target:

double _Complex add(double _Complex a, double _Complex b) { return a + b; }
0000000000000000 <add>:
   0:   c5 fb 58 c2             vaddsd %xmm2,%xmm0,%xmm0
   4:   c5 f3 58 cb             vaddsd %xmm3,%xmm1,%xmm1
   8:   c3                      retq

Our Complex<T> is defined with #[repr(C)] to give it the same memory layout, but we can't actually influence the calling ABI to be the same -- although I think it does happen to work for x86_64.

However, you can use explicit SIMD in Rust:

use std::arch::x86_64::{__m128d, _mm_add_pd};

#[repr(transparent)]
pub struct Complex(__m128d);

impl std::ops::Add for Complex {
    type Output = Self;

    // extern "Rust" passes the args and return value indirectly
    fn add(self, other: Self) -> Self {
        Self(unsafe { _mm_add_pd(self.0, other.0) })
    }
}

// extern "C" passes the args and return value directly in XMM
pub extern "C" fn complex_add(a: Complex, b: Complex) -> Complex {
    a + b
}

For skylake this gives me:

0000000000000000 <_ZN58_$LT$issue47..Complex$u20$as$u20$core..ops..arith..Add$GT$3add17h21194b99a9c7989eE>:
   0:   c5 f9 28 06             vmovapd (%rsi),%xmm0
   4:   c5 f9 58 02             vaddpd (%rdx),%xmm0,%xmm0
   8:   c5 f9 29 07             vmovapd %xmm0,(%rdi)
   c:   48 89 f8                mov    %rdi,%rax
   f:   c3                      retq

0000000000000000 <_ZN7issue4711complex_add17he473582405d2b190E>:
   0:   c5 f9 58 c1             vaddpd %xmm1,%xmm0,%xmm0
   4:   c3                      retq

I was expecting that LLVM would do the architecture specific optimisations rather than having to force Rust to choose them. So perhaps this is more of an LLVM issue. But I was able to create LLVM code that did use stacked vector operations which makes me unsure exactly where the issue actually lies.

But I was able to create LLVM code that did use stacked vector operations

Yes, but you had to change the function signature to do so. Maybe LLVM could optimize that for functions with hidden visibility, but certainly not for anything that may be linked elsewhere. I expect the more common case is this will just be inlined, and then auto-vectorization can do whatever it wants.

Anyway, I don't think there's anything we can do from here.

This is closed but I dived more deeply into this. The AMD64 ABI asks for individual fields of structs to be passed in individual registers. This means the "re" and "im" parts of the Complex struct must be passed individually by definition.

If you rewrite the function by borrowing the Complex arguments, then the stacked vectorisation is used. And inlining of the function gets rid of the use of the stack in most cases. But you must use this call signature. If you pass directly then the inlined function has the fields picked apart and LLVM doesn't seem to be smart enough to put them back together in the inlined function under optimisation.

Related: [llvm-dev] RFC: Complex in LLVM

However, even if that goes through, it wouldn't be something we could access as a crate. Maybe LLVM would also be clever enough to recognize our type as matching that internal support, or perhaps it would be an impetus to add primitive complex types in the Rust language.