d-gamedev-team/gfm

Usage of "auto ref" for Vector and Matrix operations

claudemr opened this issue · 7 comments

I continue the discussion about "auto ref" we started there:
#169

I made some tests with "auto ref", and as a conclusion, I see no reason we should avoid it. As you (p0nce) said, template bloat is not issue at all, and it will never generate useless template instantiation.

I had a look at some assembler output of a piece of code using passing struct by "value", by "const ref" and "auto ref const".

Here's the code:

import std.stdio;

struct A
{
    int[4] data;
}

int proc_val(A a) nothrow pure @safe @nogc
{
    return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}

int proc_cref(const ref A a) nothrow pure @safe @nogc
{
    return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}

int proc_aref()(auto ref const(A) a) nothrow pure @safe @nogc
{
    return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}

void main()
{
    auto a = A([0, 1, 2, 3]);
    writeln(proc_val(a));        //ok
    writeln(proc_cref(a));       //ok
    writeln(proc_aref(a));       //ok
    
    writeln(proc_val(A([3, 2, 4, -6])));        //ok
    //writeln(proc_cref(A([3, 2, 4, -6])));       //nok
    writeln(proc_aref(A([3, 2, 4, -6])));       //ok  
}

I compiled with LDC with ARM backend (I am more comfortable with arm assembly language, but I think the result is the same with x86 assembly output), with -O2 to have the tightest code.

ldc2 -v -O2 -c -mtriple=arm-none-linux-gnueabi -gcc=/opt/arm-2009q1/bin/arm-none-linux-gnueabi-gcc constref.d
/opt/arm-2009q1/bin/arm-none-linux-gnueabi-objdump -S constref.o > constref.S

And the code generated is for each "proc" function (or instantiation):

_D8constref8proc_valFNaNbNiNfS8constref1AZi:
 	add	r0, r0, r1
 	add	r0, r0, r2
 	add	r0, r0, r3
 	bx	lr

_D8constref9proc_crefFNaNbNiNfKxS8constref1AZi:
 	ldm	r0, {r1, r2, r3}
 	ldr	r0, [r0, #12]
 	add	r1, r2, r1
 	add	r1, r1, r3
 	add	r0, r1, r0
 	bx	lr

_D8constref14__T9proc_arefZ9proc_arefFNaNbNiNfKxS8constref1AZi:
 	ldm	r0, {r1, r2, r3}
 	ldr	r0, [r0, #12]
 	add	r1, r2, r1
 	add	r1, r1, r3
 	add	r0, r1, r0
 	bx	lr

_D8constref14__T9proc_arefZ9proc_arefFNaNbNiNfxS8constref1AZi:
 	add	r0, r0, r1
 	add	r0, r0, r2
 	add	r0, r0, r3
 	bx	lr

So the 2 instantiations of proc_aref generate exactly the same code as respectively proc_val and proc_cref.

A small comment: in the ARM backend particular case, the C ABI requires that the first 4 32-bit parameters are passed with the 4 first registers : r0 to r3, and the following parameters are pushed onto the stack. So in this particular case, functions using "parameter by value" take less instructions and no memory dereference (ldm and ldr), but we should not rely on that (due to caller needing to load registers anyway).

Well, unless someone has some serious argument against "auto ref", I would highly suggest to use it in Matrix and Vector functions.

p0nce commented

Thanks for the listings.
It is nice that in the case of "value" the auto ref version seems as fast, but we have to measure it on a real case.
And, without thinking too much, I have a sudden concern more calls would become "by-ref" where they currently aren't. I doubt this will be slower but well one never knows.

To illustrate:

void call_auto_ref_function(auto ref a)
{
    // do something
}
vec4f a;
call_auto_ref_function(a);

call_auto_ref_function is instantiated passing a by-ref, but in case call_auto_ref_function is inlined the copy to stack would be avoided and the performance will maybe not look the same.
We really have to measure in presence of inlining too.

Hence why I'm wary and need to perform benchmarks on code heavily using Vector to reach a conclusion.

Ah yes, that's true..
To do proper realistic benchmarks, maybe we should take a look at a proper 3D engine.

I am not a specialist at all in this domain, but I think matrices/vector/quaternion operations are mostly used when calculating the position/orientation of a hierarchy of object in the 3D scene (a lot of matrix multiplication, inverse, matrix/vector multiplication, quaternion to matrix transposition).

Or maybe mesh generation, with a lot of vector interpolation, normalization, dot/vector product.

My embryo of 3D engine is not mature enough to give proper feedback.

p0nce commented

Be sure I'll get back to this later, this could help with performance of rendering code I'm using.

Yes, it's not urgent. I have not worked out how to do a proper realistic benchmark yet.

But at least, we have a solution if it appears passing by value becomes a performance bottleneck.

p0nce commented

FWIW I plan to use pbr-sketch (https://github.com/AuburnSounds/dplug/tree/master/tools/pbr-sketch) as benchmark, fits my needs. However it uses a forked version of gfm:math, whose duplication is not very clean after all.

p0nce commented

I tried with pbr-sketch.
So for Vector case, auto ref const doesn't seem to make any difference in performance. Pretty sure I had the same result before.

Without  auto ref const:

Width = 666  height = 412
Rendered 200 times in 24.818 sec
Time samples: [126, 121, 122, 126, 121, 121, 122, 121, 121, 125, 121, 134, 149, 150, 139, 120, 123, 122, 122, 121, 123, 121, 121, 121, 124, 122, 122, 127, 120, 121, 122, 121, 120, 124, 122, 121, 122, 123, 121, 122, 121, 122, 122, 126, 127, 129, 122, 122, 122, 123, 122, 121, 152, 148, 149, 129, 121, 125, 121, 121, 121, 120, 122, 120, 121, 124, 121, 120, 120, 121, 125, 121, 120, 124, 126, 121, 121, 122, 122, 121, 121, 121, 126, 121, 122, 121, 121, 125, 121, 121, 125, 121, 138, 147, 148, 139, 121, 122, 122, 121, 120, 121, 122, 125, 120, 121, 123, 121, 120, 121, 121, 122, 121, 120, 123, 121, 121, 120, 120, 122, 129, 120, 123, 123, 123, 122, 120, 122, 120, 122, 120, 124, 124, 159, 152, 140, 123, 121, 124, 120, 121, 121, 121, 120, 122, 120, 122, 123, 121, 119, 122, 120, 122, 124, 122, 123, 120, 121, 135, 122, 122, 120, 120, 125, 121, 122, 121, 121, 123, 123, 124, 122, 122, 145, 147, 147, 134, 120, 121, 123, 121, 120, 121, 121, 121, 122, 125, 121, 123, 121, 120, 121, 120, 123, 121, 121, 124, 121, 121, 121]
Min  = 119 ms
Mean = 124.09 ms per render
Writing 371238 bytes


With auto ref const wherever possible:

width = 666  height = 412
Rendered 200 times in 24.815 sec
Time samples: [125, 129, 120, 126, 122, 122, 123, 120, 124, 121, 121, 124, 123, 121, 120, 120, 121, 121, 121, 122, 123, 121, 121, 122, 121, 122, 121, 120, 128, 151, 160, 142, 120, 121, 121, 121, 121, 120, 122, 126, 122, 120, 121, 122, 130, 133, 130, 128, 122, 120, 122, 120, 124, 122, 119, 125, 121, 121, 128, 122, 149, 121, 120, 122, 120, 121, 120, 121, 131, 149, 151, 146, 124, 125, 124, 122, 128, 122, 122, 121, 122, 120, 121, 120, 126, 121, 120, 121, 120, 122, 122, 120, 120, 126, 120, 120, 121, 119, 122, 121, 124, 122, 120, 120, 120, 120, 121, 123, 119, 142, 155, 148, 131, 120, 122, 121, 120, 121, 120, 122, 121, 120, 120, 123, 121, 122, 122, 120, 121, 120, 120, 120, 121, 121, 123, 120, 121, 120, 121, 120, 124, 122, 124, 120, 121, 121, 120, 120, 120, 122, 154, 149, 150, 122, 121, 120, 121, 122, 125, 120, 120, 120, 122, 120, 120, 121, 121, 123, 121, 120, 120, 124, 120, 121, 122, 127, 122, 120, 120, 122, 121, 120, 120, 123, 122, 121, 119, 121, 121, 121, 130, 152, 152, 143, 120, 122, 122, 124, 120, 124]
Min  = 119 ms
Mean = 124.075 ms per render
Writing 371238 bytes

I'll try to modify the rendering to include matrix operation so that the benchmark is useful to bench matrix * vector ops.

p0nce commented

Testing with a lot of 3x3 matrix * multiply and matrix * vector multiply

// Without auto ref:
width = 666  height = 412
Rendered 50 times in 30.172 sec
Time samples: [604, 602, 632, 710, 581, 585, 575, 574, 574, 573, 573, 589, 646, 576, 578, 804, 664, 667, 630, 654, 654, 604, 591, 592, 597, 581, 582, 578, 655, 577, 575, 575, 575, 576, 574, 579, 634, 614, 593, 583, 578, 583, 586, 578, 581, 657, 575, 599, 579, 576]
Min  = 573 ms
Mean = 603.44 ms per render
Writing 63547 bytes


// With auto ref const
width = 666  height = 412
Rendered 50 times in 30.523 sec
Time samples: [608, 605, 595, 600, 596, 649, 635, 641, 603, 607, 606, 604, 596, 630, 664, 600, 599, 598, 599, 605, 595, 599, 684, 597, 599, 594, 601, 596, 600, 605, 685, 601, 592, 602, 599, 592, 604, 596, 685, 596, 598, 598, 594, 602, 600, 593, 644, 639, 596, 597]
Min  = 592 ms
Mean = 610.46 ms per render
Writing 63547 bytes

auto ref doesn't seem to help speed at all, if anything it's a bit slower.

Compiler used: ldc v1.0.0-b2, 64-bit, -b release-nobounds

dub build --skip-registry=all -b release-nobounds --compiler=path\to\ldc2.exe
pbr-sketch -n 50

My guess is that the compiler already does some pass-by-ref optimization?