SIMDPirates.jl
is a library for SIMD
intrinsics. The code was stolen from SIMD.jl, whose authors and maintainers deserve credit for the good work here. Aside from pirating code, SIMDPirates
threatens to pirate methods. SIMD.jl
provides methods for a custom vector type defined as:
const VE = Base.VecElement
struct Vec{N,T<:ScalarTypes} <: DenseArray{T,1} # <: Number
elts::NTuple{N,VE{T}}
@inline Vec{N,T}(elts::NTuple{N, VE{T}}) where {N,T} = new{N,T}(elts)
end
This allows SIMD
to extend methods for functions such as +
, *
, or sqrt
to dispatch to the appropriate SIMD
intrinsic methods when called on the SIMD.Vec
type. However, SIMDPirates instead defines an alias to a base type:
const Vec{N,T} = NTuple{N,Core.VecElement{T}}
instead of wrapping the NTuple{N,Core.VecElement{T}}
in another struct
type. This means that SIMDPirates
cannot extend these base methods without committing type piracy. For this reason, base methods are not extended. Instead, the library provides unexported methods, generally prefixed with a v
. Instead of a + b
, we have SIMDPirates.vadd(a,b)
. In place of fma(a,b,c)
, it is SIMDPirates.vfma(a,b,c)
, etc.
Type piracy is extending methods of functions defined outside of the present library to types defined outside the library. This is not kosher, because then the act of importing a library suddenly changes global state; any other code that assumes different behavior between these functions and types now suddenly has bugs. However, the @pirate
macro allows you to locally pirate only the methods in the following expression, avoiding the dangers of piracy.
julia> using SIMDPirates
julia> C = randn(8,8);
julia> v1 = vload(Vec{8,Float64},C, 1);
julia> v2 = vload(Vec{8,Float64},C, 9);
julia> v3 = vload(Vec{8,Float64},C,17);
julia> v4 = vload(Vec{8,Float64},C,25);
julia> @macroexpand @pirate sqrt(v1)
:((SIMDPirates.vsqrt)(v1))
julia> @macroexpand @pirate v1 + v2
:((SIMDPirates.vadd)(v1, v2))
julia> @macroexpand @pirate v1 + v2 * v3
:((SIMDPirates.vmuladd)(v2, v3, v1))
julia> @macroexpand @pirate v1 * v2 + v3 * v4
:((SIMDPirates.vmuladd)(v1, v2, (SIMDPirates.vmul)(v3, v4)))
Why the trouble?
LLVM has an easier time optimizing code without the wrapper. Using SIMDPirates.jl
:
julia> @code_native SIMDPirates.vmuladd(v1, v2, v3)
.text
; Function vmuladd {
; Location: floating_point_arithmetic.jl:48
; Function llvmwrap; {
; Location: llvmwrap.jl:148
; Function llvmwrap; {
; Location: llvmwrap.jl:148
; Function macro expansion; {
; Location: floating_point_arithmetic.jl:48
vfmadd213pd %zmm2, %zmm1, %zmm0
;}}}
retq
nopw (%rax,%rax)
;}
while SIMD.jl
results in plenty of unnecessary mov instructions:
julia> @code_native muladd(v1, v2, v3)
.text
; Function muladd {
; Location: SIMD.jl:1007
; Function llvmwrap; {
; Location: SIMD.jl:663
; Function llvmwrap; {
; Location: SIMD.jl:663
; Function macro expansion; {
; Location: SIMD.jl:1007
vmovupd (%rsi), %zmm0
vmovupd (%rdx), %zmm1
vfmadd213pd (%rcx), %zmm0, %zmm1
;}}}
vmovapd %zmm1, (%rdi)
movq %rdi, %rax
vzeroupper
retq
nop
;}
Compared to SIMDPirates.jl
, SIMD.jl
resulted in lots of unnecessary instructions; in particular: two vmovupd
, one vmovapd
, and one movq
.
The compiler can often eliminate many of these instructions, but rarely all
Additionally, SIMDPirates.jl
works better with SLEEFwrap.jl
for high performance vectorized elementary functions. SIMD.jl
results in several extra move instructions per call, and in the case of avx-512, has been highly unstable with frequent segmentation faults (although SLEEFwrap.sincos
still segfaults in benchmarks).