JuliaLinearAlgebra/RecursiveFactorization.jl

Unexpected timings lu vs precision

Opened this issue · 7 comments

Hi, I'm playing with RecursiveFactorization and am finding that single precision factorizations are much faster than the cubic scaling would predict. Float16 seems slower than the prediction as well. I was hoping the Apple silicon and Julia support for Float16 in hardware would show better, but it is still far better than using OpenBlas LU in Float16.

I'd think that time(double) = 2 x time(single) = 4 x time(half)

would be the case for larger problems, but am not seeing that. Is there an algorithmic reason for this? Should I be setting parameters in the factorization to something other than the default values?

I ran into this while using this package for my own work.
This post is my happy story.

If you have insight or advice, I'd like that. The results come from a M2 Pro Mac and Julia 1.10-beta 1.

I ran this script (see below) to factor a random matrix of size N at three precisions and got this

julia> speed_test(2048)
Time for Float64 = 3.22886e-01
Time for Float32 = 1.70550e-02
Time for Float16 = 5.77817e-02

julia> speed_test(4096)
Time for Float64 = 3.68761e+00
Time for Float32 = 2.47510e-01
Time for Float16 = 4.73640e-01

julia> speed_test(8192)
Time for Float64 = 3.87592e+01
Time for Float32 = 3.12564e+00
Time for Float16 = 6.68709e+00

using Random
using BenchmarkTools
using RecursiveFactorization
using LinearAlgebra

function speed_test(N=2048)
Random.seed!(46071)
A=rand(N,N)

for T in [Float64, Float32, Float16]
    AT=T.(copy(A))
    tlu=@belapsed rlu($AT)
    println("Time for $T = $tlu")
end

end

function rlu(C)
    B = copy(C)
    CF = RecursiveFactorization.lu!(B, Vector{Int}(undef, size(B, 2)))
    return CF
end

Float64's bad performance is a Julia 1.10 bug.
On my Intel CPU (10980XE), Julia 1.9:

julia> speed_test(2048)
Time for Float64 = 0.037588217
Time for Float32 = 0.01992773
Time for Float16 = 1.875628065

julia> speed_test(4096)
Time for Float64 = 0.309555341
Time for Float32 = 0.11584808
Time for Float16 = 14.296447513

julia> speed_test(8192)
Time for Float64 = 2.595502994
Time for Float32 = 1.237905769
Time for Float16 = 113.31325847

Julia master:

julia> speed_test(2048)
Time for Float64 = 0.183237959
Time for Float32 = 0.019942183
Time for Float16 = 0.278549435

julia> speed_test(4096)
Time for Float64 = 0.864374881
Time for Float32 = 0.104797648
Time for Float16 = 1.54903296

julia> speed_test(8192)
Time for Float64 = 6.382979536
Time for Float32 = 1.499058728
Time for Float16 = 14.409940574

I am now rebuilding to see if this was fixed by JuliaLang/julia@bea8c44

The Float64 regression does seem to be fixed on the latest master, presumably by the commit I referenced above:

julia> speed_test(2048)
Time for Float64 = 0.035810193
Time for Float32 = 0.016416317
Time for Float16 = 0.24953308

julia> speed_test(4096)
Time for Float64 = 0.297429925
Time for Float32 = 0.131990487
Time for Float16 = 1.526133223

julia> speed_test(8192)
Time for Float64 = 3.204072447
Time for Float32 = 1.394537522
Time for Float16 = 13.691441064

As for why Float16 is so slow, it isn't supported by TriangularSolve.jl.
I'm not sure how much of the difference this accounts for, but you're welcome to make a PR.

A secondary issue is that LoopVectorization likely treats Float16 as similarly sized to Float32:

julia> using VectorizationBase

julia> VectorizationBase.pick_vector_width(Float16)
static(16)

julia> VectorizationBase.pick_vector_width(Float32)
static(16)

julia> VectorizationBase.pick_vector_width(Float64)
static(8)

You'd probably get 4, 4, and 2, when it should be 8, 4, and 2 for an M2 that supports native Float32 (LV handles it on my CPU via promoting to Float32, and then truncating again).
You're also welcome to make PRs addressing that.

Is TriangluarSolve.jl used for the solves after the factorization or within it somewhere? The solves after the factorization have identical timings as far as I can tell. I will dev TriangularSolve and see if adding Float16 to
the places where I see T<:Union{Float32,Float64} makes any difference. If it does I will make a PR.

I looked at LoopVectorization, but don't understand what "vector_width" means or how to change it. This one is far above my pay grade.

Is TriangluarSolve.jl used for the solves after the factorization or within it somewhere?

Within.

I looked at LoopVectorization, but don't understand what "vector_width" means or how to change it. This one is far above my pay grade.

Vector width is the width of the SIMD vectors it uses.

I made the change to TriangularSolve.jl and got this

ERROR: MethodError: no method matching VectorizationBase.VecUnroll(::Tuple{VectorizationBase.Vec{4, Float16}, VectorizationBase.Vec{4, Float32}, VectorizationBase.Vec{4, Float32}, VectorizationBase.Vec{4, Float32}})

I am in over my head. Is there something simple I can to to fix this. If there's a line or two in a file that I can change, I will do it if you will tell me what the lines and files are.

I am in over my head. Is there something simple I can to to fix this. If there's a line or two in a file that I can change, I will do it if you will tell me what the lines and files are.

I don't know. The vast majority of the work is knowing the lines, not making the changes.

In this case, you could look at the stack trace and try to see why

ERROR: MethodError: no method matching VectorizationBase.VecUnroll(::Tuple{
  VectorizationBase.Vec{4, Float16},
  VectorizationBase.Vec{4, Float32},
  VectorizationBase.Vec{4, Float32},
  VectorizationBase.Vec{4, Float32}
})

we have only one Float16 and the rest Float32.

Upon further review ...
Everything works for small problems and I get the failure for larger ones.
Example below.

This is on 1.9.2.

I looked at the lines in TriangularSolve.jl in the stacktrace and could not figure out what was happening. Nor can I figure out why small problems work.

I see that Float16 gets promoted in

VectorizationBase.jl/src/promotion.jl

but do not understand what is happening there.


julia> A=rand(Float16,5,5);

julia> AF=RecursiveFactorization.lu!(A);

julia> B=rand(Float16,512,512);

julia> BF=RecursiveFactorization.lu!(B);

ERROR: MethodError: no method matching VectorizationBase.VecUnroll(::Tuple{VectorizationBase.Vec{4, Float16}, VectorizationBase.Vec{4, Float32}, VectorizationBase.Vec{4, Float32}, VectorizationBase.Vec{4, Float32}})

Closest candidates are:
(::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number}
@ Base char.jl:50
(::Type{T})(::Base.TwicePrecision) where T<:Number
@ Base twiceprecision.jl:266
(::Type{T})(::Complex) where T<:Real
@ Base complex.jl:44
...

Stacktrace:
[1] macro expansion
@ ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:45 [inlined]
[2] solve_AU(A::VectorizationBase.VecUnroll{3, 4, Float16, VectorizationBase.Vec{4, Float16}}, spu::LayoutPointers.StridedPointer{Float16, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{2}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, noff::Int64, #unused#::Val{true})
@ TriangularSolve ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:22
[3] macro expansion
@ ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:208 [inlined]
[4] rdiv_solve_W!
@ ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:180 [inlined]
[5] rdiv_U!(spc::LayoutPointers.StridedPointer{Float16, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{2}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, spa::LayoutPointers.StridedPointer{Float16, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{2}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, spu::LayoutPointers.StridedPointer{Float16, 2, 2, 0, (2, 1), Tuple{Int64, Static.StaticInt{2}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}}}, M::Int64, N::Int64, #unused#::Static.StaticInt{2}, #unused#::Val{true})
@ TriangularSolve ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:724
[6] div_dispatch!(C::Transpose{Float16, SubArray{Float16, 2, Matrix{Float16}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}}, A::Transpose{Float16, SubArray{Float16, 2, Matrix{Float16}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}}, U::Transpose{Float16, SubArray{Float16, 2, Matrix{Float16}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}}, nthread::Static.StaticInt{1}, #unused#::Val{true})
@ TriangularSolve ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:320
[7] ldiv!
@ ~/Dropbox/Julia/dotdev/local/TriangularSolve/src/TriangularSolve.jl:479 [inlined]
[8] reckernel!(A::SubArray{Float16, 2, Matrix{Float16}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, pivot::Val{true}, m::Int64, n::Int64, ipiv::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, info::Int64, blocksize::Int64, thread::Val{false})
@ RecursiveFactorization ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:214
[9] reckernel!(A::SubArray{Float16, 2, Matrix{Float16}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}, pivot::Val{true}, m::Int64, n::Int64, ipiv::SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, info::Int64, blocksize::Int64, thread::Val{false}) (repeats 4 times)
@ RecursiveFactorization ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:208
[10] reckernel!(A::Matrix{Float16}, pivot::Val{true}, m::Int64, n::Int64, ipiv::Vector{Int64}, info::Int64, blocksize::Int64, thread::Val{false})
@ RecursiveFactorization ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:208
[11] _recurse!
@ ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:126 [inlined]
[12] recurse!
@ ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:117 [inlined]
[13] lu!(A::Matrix{Float16}, ipiv::Vector{Int64}, pivot::Val{true}, thread::Val{true}; check::Bool, blocksize::Int64, threshold::Int64)
@ RecursiveFactorization ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:102
[14] lu!
@ ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:82 [inlined]
[15] #lu!#3
@ ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:64 [inlined]
[16] lu!
@ ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:55 [inlined]
[17] lu!(A::Matrix{Float16})
@ RecursiveFactorization ~/.julia/packages/RecursiveFactorization/MHCIz/src/lu.jl:55
[18] top-level scope
@ REPL[8]:1

julia>