core.simd and gl3n

Question

core.simd and gl3n

Opened this issue 8 years ago · 14 comments

Greetings.
I wonderd if it would be usefull to use https://dlang.org/spec/simd.html in gl3n. This could come in handy if you use gl3n alot in collision detection or something similar.

Would that make sense to do?

Answer 1 · 2016-10-06T14:23:43.000Z

Yes, that would make a lot of sense, back when I wrote gl3n and simd came up, I was waiting on std.simd, but that never happened ...

Answer 2 · 2016-10-06T15:08:27.000Z

I made a fork and will try to see if i can implement that somehow. I wouldn't count on me though, i don't have too much time :/

Answer 3 · 2016-10-06T17:06:48.000Z

So I think there is a problem that will start to exsist if simd is used to replace the non-simd math.
Because stuff might become slower

performance, hasSimd = true...
Lots of ops took: 0.10047s

vs

performance, hasSimd = false...
Lots of ops took: 0.0750042s

The measured op was

for (int i = 0; i < 1_000_000; i++) {
    vec4 a = 43223.0;
    vec4 b = 1234.0;
    a+=b;
}

That slowdown becomes even worse if i use automated vectorization
vector[] += r.vector[]
takes 0.15, so 3 times as much time as when using float4 (in that case)

Now i changed the code a bit

vec4 a = 43223.0;
vec4 b = 1234.0;
for (int i = 0; i < 1_000_000; i++) {
       a+=b;
}

That results in the expected (even though tiny) speedup.

performance, hasSimd = true...
Lots of ops took: 0.0097749s

vs

performance, hasSimd = false...
Lots of ops took: 0.0159956s

These differences exsist because the vectors need to be loaded into the simd registers. So operations on the same set of vectors will speed up alot, the general use will slow down alot.
So I think that, to implement this, there would be a need for a different set of functions that utilize this, because otherwise it would just be a slow down.

Answer 4 · 2016-10-06T18:10:26.000Z

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

Code which wants to accept both versions of vectors needs to use foo(T vec) is(some_vector!T), that's why I want Compiletime-Interfaces ...

Answer 5 · 2016-10-08T08:39:02.000Z

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

I tried to integrate it into the normal vector classes via an template argument and in itself that works fine. However i am not able to notice any speedup at all (tbh i only implemented the basic operations and testet them) - and i suspect the implementation of both the core.simd.Vector types and the __simd magic cause a lot of copying around of data, which is kinda locigal because the instructions run on the xmm/ymm registers.

Vector!(float,4,true) a,b,c;
a = b = c = 1234.23234;
for(long i = 0; i < 1_000_000; i++) {
    a+=b;
    a+=c;
    a+=a;
}

Should, in my understanding, run faster if a,b,c use core.simd.vector!(float[4]) - it however always ran slower than what i expected.

It would be nice to work with data within these registers like you can with the Intel/C++ compiler foo (the _mm_add_ss -like functions that take __m128 and __m256 types). So Id go so far to seperate the normal vector/matrix types from the SIMD acceleration completly. So then you would do something like

vec4 a (123,434,124,123);
vec4 b (434,342,323,434);
simdVec!vec4 areg = a;
simdVec!vec4 breg = b;
for(int i = 0; i < 1_000_000; i++) {
    areg += breg; /// ADDPS
    breg += areg; /// ADDPS
}
float magnitude = areg.magnitude; /// can be done with DOTPS and SQRTPS 
a = areg.toVec();
b = areg.toVec();

The main difference would be that the simd-type (wich idially would equal a media register) allows no direct access to the memory to avoid any copying.

Also im kinda missing the AVX (256bit ymm registers, double[4]-stuff) support in core.simd.__simd. Hmm.

Am I thinking right or am I blubbering complete bullshit? O.o

EDIT: I might have found the reason: this code:

import core.simd;

void doStuff()
{
    float4 x = [1.0,0.4,1234.0,124.0]; 
float4 y = [1.0,0.4,1234.0,124.0]; 
float4 z = [1.0,0.4,1234.0,123.0];
  for(long i = 0; i<1_000_000; i++) {
    x += y;
    x += z;
    z += x;
  }
}

Can be split in two parts. The first one is the assignment:

movaps xmm0,XMMWORD PTR [rip+0x0]        # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0]        # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0]        # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2

Well ok it also copies the stuff onto the stack? meh. Now the math in the loop:

 movaps xmm3,XMMWORD PTR [rbp-0x30]
 movaps xmm4,XMMWORD PTR [rbp-0x40]
 addps  xmm4,xmm3
 movaps XMMWORD PTR [rbp-0x40],xmm4
 movaps xmm0,XMMWORD PTR [rbp-0x20]
 movaps xmm1,XMMWORD PTR [rbp-0x40]
 addps  xmm1,xmm0
 movaps XMMWORD PTR [rbp-0x40],xmm1
 movaps xmm2,XMMWORD PTR [rbp-0x40]
 movaps xmm3,XMMWORD PTR [rbp-0x20]
 addps  xmm3,xmm2
 movaps XMMWORD PTR [rbp-0x20],xmm3

OUCH! This should simply be

addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0

I guess i should report that as compiler bug? https://issues.dlang.org/show_bug.cgi?id=16605

Answer 6 · 2016-10-08T10:30:07.000Z

Thanks for looking into all of this.

I can't really help you here since my knowledge of SSE/SIMD instructions is very limited, you might want to ask in #D on freenode, there are some very smart people with compiler insight who probably can help you in a timely manner.

Answer 7 · 2016-10-08T11:47:44.000Z

No Problem, I enjoy this kind of stuff :)

Im gonna head there, because im still not sure if my knowleadge about the SSE/SIMD is enough to come to the right conclusions. Lets see where this is headed!

Answer 8 · 2016-10-09T10:10:31.000Z

It was me who was the fool! "-release" != "-O -release -boundscheck=off"
Now that looks like something!

Running ./gl3nspeed 
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.140215s!
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.050686s!

Answer 9 · 2016-10-09T10:12:32.000Z

That's almost 3 times faster!

Answer 10 · 2016-10-09T11:48:38.000Z

That's almost 3 times faster!

It gets better!

Enter loop count
10000000
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.139876s!
Speed of the magnitude operation on float |vec4|
took: 5.30766s! 
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.0507099s!
Speed of the magnitude operation on float |vec4|
took: 1.02721s!

Im gonna clean this up a bit and make push to my fork so you can take a look at it - if it fits the guidelines/way you wan't stuff to be done for gl3n.

Answer 11 · 2016-10-09T12:33:25.000Z

Here are my changes so far: master...mkalte666:master

I know that this is missing tests etc. These i will write as soon as i can i guess.

The speed test tool i used is https://github.com/mkalte666/gl3nspeed

You have to compile gl3n/gl3nspeed with "DLFAGS="-release -O -boundscheck=ff" dub ". Or tell me how i can get dub to use -O xD

Answer 12 · 2016-10-09T19:47:45.000Z

Looks good, minor style things but in general I like how it is done!

You gonna look into matrices as well?

Answer 13 · 2016-10-09T20:59:52.000Z

Looks good, minor style things but in general I like how it is done!

Thanks, im trying ^^

You gonna look into matrices as well?

If i find the time. Im not sure how well that can be done and what instructions already exsist that could help out. Also I still want look into #68, and i guess that could be combined

Thinking about speed and not about time management on my side this would be a massive improvement however: "4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts)" http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly

Answer 14 · 2016-10-10T17:17:48.000Z

One thing I wonder is if operations with scalars (vec3*float etc) should be vectorized. While the operation itself would speed up, as long as the numerical value is not const, the resulting code would almost always be slower because the scalar would have to be loaded into a vector beforehand.

The speedy way of doing a (any operation) multiplicaion would be to hold a (const?) vector somewhere and then do the operations. So doing

Vector!(float,4,true) scalar = 4.0;
Vector!(float,4,true) foo = 1234.01234;
Vector!(float,4,true) bar = 13.2434;
foo *= scalar;
bar *= scalar; 
// ..... probably do this many times

would almost always result in faster code than if one would do

foo *= 4.0;
bar *= 4.0;

because the operator doesn't know if it operates on a const value or a variable. If there is a way to seperate them (detecting if a value is const), that it could be done though - I don't know how.