core.simd and gl3n
Opened this issue · 14 comments
Greetings.
I wonderd if it would be usefull to use https://dlang.org/spec/simd.html in gl3n. This could come in handy if you use gl3n alot in collision detection or something similar.
Would that make sense to do?
Yes, that would make a lot of sense, back when I wrote gl3n and simd came up, I was waiting on std.simd, but that never happened ...
I made a fork and will try to see if i can implement that somehow. I wouldn't count on me though, i don't have too much time :/
So I think there is a problem that will start to exsist if simd is used to replace the non-simd math.
Because stuff might become slower
performance, hasSimd = true...
Lots of ops took: 0.10047s
vs
performance, hasSimd = false...
Lots of ops took: 0.0750042s
The measured op was
for (int i = 0; i < 1_000_000; i++) {
vec4 a = 43223.0;
vec4 b = 1234.0;
a+=b;
}
That slowdown becomes even worse if i use automated vectorization
vector[] += r.vector[]
takes 0.15, so 3 times as much time as when using float4 (in that case)
Now i changed the code a bit
vec4 a = 43223.0;
vec4 b = 1234.0;
for (int i = 0; i < 1_000_000; i++) {
a+=b;
}
That results in the expected (even though tiny) speedup.
performance, hasSimd = true...
Lots of ops took: 0.0097749s
vs
performance, hasSimd = false...
Lots of ops took: 0.0159956s
These differences exsist because the vectors need to be loaded into the simd registers. So operations on the same set of vectors will speed up alot, the general use will slow down alot.
So I think that, to implement this, there would be a need for a different set of functions that utilize this, because otherwise it would just be a slow down.
Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.
Code which wants to accept both versions of vectors needs to use foo(T vec) is(some_vector!T)
, that's why I want Compiletime-Interfaces ...
Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.
I tried to integrate it into the normal vector classes via an template argument and in itself that works fine. However i am not able to notice any speedup at all (tbh i only implemented the basic operations and testet them) - and i suspect the implementation of both the core.simd.Vector types and the __simd magic cause a lot of copying around of data, which is kinda locigal because the instructions run on the xmm/ymm registers.
Vector!(float,4,true) a,b,c;
a = b = c = 1234.23234;
for(long i = 0; i < 1_000_000; i++) {
a+=b;
a+=c;
a+=a;
}
Should, in my understanding, run faster if a,b,c use core.simd.vector!(float[4]) - it however always ran slower than what i expected.
It would be nice to work with data within these registers like you can with the Intel/C++ compiler foo (the _mm_add_ss -like functions that take __m128 and __m256 types). So Id go so far to seperate the normal vector/matrix types from the SIMD acceleration completly. So then you would do something like
vec4 a (123,434,124,123);
vec4 b (434,342,323,434);
simdVec!vec4 areg = a;
simdVec!vec4 breg = b;
for(int i = 0; i < 1_000_000; i++) {
areg += breg; /// ADDPS
breg += areg; /// ADDPS
}
float magnitude = areg.magnitude; /// can be done with DOTPS and SQRTPS
a = areg.toVec();
b = areg.toVec();
The main difference would be that the simd-type (wich idially would equal a media register) allows no direct access to the memory to avoid any copying.
Also im kinda missing the AVX (256bit ymm registers, double[4]-stuff) support in core.simd.__simd. Hmm.
Am I thinking right or am I blubbering complete bullshit? O.o
EDIT: I might have found the reason: this code:
import core.simd;
void doStuff()
{
float4 x = [1.0,0.4,1234.0,124.0];
float4 y = [1.0,0.4,1234.0,124.0];
float4 z = [1.0,0.4,1234.0,123.0];
for(long i = 0; i<1_000_000; i++) {
x += y;
x += z;
z += x;
}
}
Can be split in two parts. The first one is the assignment:
movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2
Well ok it also copies the stuff onto the stack? meh. Now the math in the loop:
movaps xmm3,XMMWORD PTR [rbp-0x30]
movaps xmm4,XMMWORD PTR [rbp-0x40]
addps xmm4,xmm3
movaps XMMWORD PTR [rbp-0x40],xmm4
movaps xmm0,XMMWORD PTR [rbp-0x20]
movaps xmm1,XMMWORD PTR [rbp-0x40]
addps xmm1,xmm0
movaps XMMWORD PTR [rbp-0x40],xmm1
movaps xmm2,XMMWORD PTR [rbp-0x40]
movaps xmm3,XMMWORD PTR [rbp-0x20]
addps xmm3,xmm2
movaps XMMWORD PTR [rbp-0x20],xmm3
OUCH! This should simply be
addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0
I guess i should report that as compiler bug? https://issues.dlang.org/show_bug.cgi?id=16605
Thanks for looking into all of this.
I can't really help you here since my knowledge of SSE/SIMD instructions is very limited, you might want to ask in #D on freenode, there are some very smart people with compiler insight who probably can help you in a timely manner.
No Problem, I enjoy this kind of stuff :)
Im gonna head there, because im still not sure if my knowleadge about the SSE/SIMD is enough to come to the right conclusions. Lets see where this is headed!
It was me who was the fool! "-release" != "-O -release -boundscheck=off"
Now that looks like something!
Running ./gl3nspeed
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.140215s!
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.050686s!
That's almost 3 times faster!
That's almost 3 times faster!
It gets better!
Enter loop count
10000000
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.139876s!
Speed of the magnitude operation on float |vec4|
took: 5.30766s!
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.0507099s!
Speed of the magnitude operation on float |vec4|
took: 1.02721s!
Im gonna clean this up a bit and make push to my fork so you can take a look at it - if it fits the guidelines/way you wan't stuff to be done for gl3n.
Here are my changes so far: master...mkalte666:master
I know that this is missing tests etc. These i will write as soon as i can i guess.
The speed test tool i used is https://github.com/mkalte666/gl3nspeed
You have to compile gl3n/gl3nspeed with "DLFAGS="-release -O -boundscheck=ff" dub ". Or tell me how i can get dub to use -O xD
Looks good, minor style things but in general I like how it is done!
You gonna look into matrices as well?
Looks good, minor style things but in general I like how it is done!
Thanks, im trying ^^
You gonna look into matrices as well?
If i find the time. Im not sure how well that can be done and what instructions already exsist that could help out. Also I still want look into #68, and i guess that could be combined
Thinking about speed and not about time management on my side this would be a massive improvement however: "4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts)" http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly
One thing I wonder is if operations with scalars (vec3*float etc) should be vectorized. While the operation itself would speed up, as long as the numerical value is not const, the resulting code would almost always be slower because the scalar would have to be loaded into a vector beforehand.
The speedy way of doing a (any operation) multiplicaion would be to hold a (const?) vector somewhere and then do the operations. So doing
Vector!(float,4,true) scalar = 4.0;
Vector!(float,4,true) foo = 1234.01234;
Vector!(float,4,true) bar = 13.2434;
foo *= scalar;
bar *= scalar;
// ..... probably do this many times
would almost always result in faster code than if one would do
foo *= 4.0;
bar *= 4.0;
because the operator doesn't know if it operates on a const value or a variable. If there is a way to seperate them (detecting if a value is const), that it could be done though - I don't know how.