FMA thoughts
neurolabusc opened this issue · 1 comments
Thanks for the well documented and elegantly coded repository. The documentation correctly notes that FMA is a single operation for ARM, but not for SSE.
One option would be to use AVX FMA3 instructions provided with modern AMD/Intel CPUs. The FreePascal code below illustrates this. I can understand why you might not want to do this - SSE is available with every x86_64 CPU, while AVX requires a more recent computer (and therefore one would want to detect this feature).
Any plans to extend this repository to combine other AVX instructions? On one hand, the 128-bit SSE is perfect for your 4-component singles, but I would have thought the 256-bits would allow you to tackle double precision values.
program avxTst;
{$mode delphi}
uses sysutils;
//https://stackoverflow.com/questions/41468871/how-to-detect-avx2-in-delphi
{$ASMMODE INTEL}
function IsAVX2supported: boolean;
asm
// Save EBX
{$IFDEF CPUx86}
push ebx
{$ELSE CPUx64}
mov r10, rbx
{$ENDIF}
//Check CPUID.0
xor eax, eax
cpuid //modifies EAX,EBX,ECX,EDX
cmp al, 7 // do we have a CPUID leaf 7 ?
jge @Leaf7
xor eax, eax
jmp @Exit
@Leaf7:
//Check CPUID.7
mov eax, 7h
xor ecx, ecx
cpuid
bt ebx, 5 //AVX2: CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]=1
setc al
@Exit:
// Restore EBX
{$IFDEF CPUx86}
pop ebx
{$ELSE CPUx64}
mov rbx, r10
{$ENDIF}
end;
type
TVec4 = packed record
x,y,z,w: single;
end;
function v(x,y,z,w:single): TVec4;
begin
result.x := x;
result.y := y;
result.z := z;
result.w := w;
end;
{$ASMMODE INTEL}
function FMA(const A, B, C: TVec4): TVec4;
//result := (A * B) + C
//https://neslib.github.io/FastMath/Neslib.FastMath.html#FMA
//https://en.wikipedia.org/wiki/FMA_instruction_set#Excerpt_from_FMA3
begin
asm
movups xmm0, [C]
movups xmm1, [B]
movups xmm2, [A]
vfmadd231ps xmm0, xmm1, xmm2 //a = b·c + a
movhlps xmm1, xmm0
vzeroupper
end;
end;
procedure tst;
var
a,b,c,d: TVec4;
begin
a := v(1,2,3,4);
b := v(5,10,10,10);
c := v(22,1,1,1);
d := FMA(a,b,c);
writeln(format('xyzw: %g %g %g %g', [d.x, d.y, d.z, d.w]));
end;
begin
if not IsAVX2supported then begin
writeln('AVX2 is not supported');
exit;
end;
tst;
end.
Glad you like the repo.
I currently don't have plans to support AVX or later instruction sets, or to support double-precision math. Some functionality in FastMath may benefit from AVX instructions, but in general, the small gain in speed is not worth the extra effort IMHO.
It's a lot of work designing and writing assembly routines for 4 different architectures alone (x86, x64, NEON and ARM64). Supporting newer instruction sets makes that even harder. Also, I settled on an instruction set (SSE2) that virtually all Intel/AMD-based computers support nowadays. Like you said, we would have to perform CPU capability checks to add support for AVX and other instruction sets. That would also mean we would have to dynamically dispatch routines to different versions based on CPU caps. Since most routines are pretty small, we have to take the additional overhead of this dynamic dispatch into account. Or maybe use conditional defines to force an instruction set (but then the app will crash on computers that don't support it).
I may add that in the future to support wider registers and double-precision floating-point math.
In the meantime, I may add AVX support for specific use cases, like the FMA example you provided. I will look into this when I have some time...