Usage of "auto ref" for Vector and Matrix operations
claudemr opened this issue · 7 comments
I continue the discussion about "auto ref" we started there:
#169
I made some tests with "auto ref", and as a conclusion, I see no reason we should avoid it. As you (p0nce) said, template bloat is not issue at all, and it will never generate useless template instantiation.
I had a look at some assembler output of a piece of code using passing struct by "value", by "const ref" and "auto ref const".
Here's the code:
import std.stdio;
struct A
{
int[4] data;
}
int proc_val(A a) nothrow pure @safe @nogc
{
return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}
int proc_cref(const ref A a) nothrow pure @safe @nogc
{
return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}
int proc_aref()(auto ref const(A) a) nothrow pure @safe @nogc
{
return a.data[0] + a.data[1] + a.data[2] + a.data[3];
}
void main()
{
auto a = A([0, 1, 2, 3]);
writeln(proc_val(a)); //ok
writeln(proc_cref(a)); //ok
writeln(proc_aref(a)); //ok
writeln(proc_val(A([3, 2, 4, -6]))); //ok
//writeln(proc_cref(A([3, 2, 4, -6]))); //nok
writeln(proc_aref(A([3, 2, 4, -6]))); //ok
}
I compiled with LDC with ARM backend (I am more comfortable with arm assembly language, but I think the result is the same with x86 assembly output), with -O2 to have the tightest code.
ldc2 -v -O2 -c -mtriple=arm-none-linux-gnueabi -gcc=/opt/arm-2009q1/bin/arm-none-linux-gnueabi-gcc constref.d
/opt/arm-2009q1/bin/arm-none-linux-gnueabi-objdump -S constref.o > constref.S
And the code generated is for each "proc" function (or instantiation):
_D8constref8proc_valFNaNbNiNfS8constref1AZi:
add r0, r0, r1
add r0, r0, r2
add r0, r0, r3
bx lr
_D8constref9proc_crefFNaNbNiNfKxS8constref1AZi:
ldm r0, {r1, r2, r3}
ldr r0, [r0, #12]
add r1, r2, r1
add r1, r1, r3
add r0, r1, r0
bx lr
_D8constref14__T9proc_arefZ9proc_arefFNaNbNiNfKxS8constref1AZi:
ldm r0, {r1, r2, r3}
ldr r0, [r0, #12]
add r1, r2, r1
add r1, r1, r3
add r0, r1, r0
bx lr
_D8constref14__T9proc_arefZ9proc_arefFNaNbNiNfxS8constref1AZi:
add r0, r0, r1
add r0, r0, r2
add r0, r0, r3
bx lr
So the 2 instantiations of proc_aref generate exactly the same code as respectively proc_val and proc_cref.
A small comment: in the ARM backend particular case, the C ABI requires that the first 4 32-bit parameters are passed with the 4 first registers : r0 to r3, and the following parameters are pushed onto the stack. So in this particular case, functions using "parameter by value" take less instructions and no memory dereference (ldm and ldr), but we should not rely on that (due to caller needing to load registers anyway).
Well, unless someone has some serious argument against "auto ref", I would highly suggest to use it in Matrix and Vector functions.
Thanks for the listings.
It is nice that in the case of "value" the auto ref version seems as fast, but we have to measure it on a real case.
And, without thinking too much, I have a sudden concern more calls would become "by-ref" where they currently aren't. I doubt this will be slower but well one never knows.
To illustrate:
void call_auto_ref_function(auto ref a)
{
// do something
}
vec4f a;
call_auto_ref_function(a);
call_auto_ref_function is instantiated passing a
by-ref, but in case call_auto_ref_function
is inlined the copy to stack would be avoided and the performance will maybe not look the same.
We really have to measure in presence of inlining too.
Hence why I'm wary and need to perform benchmarks on code heavily using Vector to reach a conclusion.
Ah yes, that's true..
To do proper realistic benchmarks, maybe we should take a look at a proper 3D engine.
I am not a specialist at all in this domain, but I think matrices/vector/quaternion operations are mostly used when calculating the position/orientation of a hierarchy of object in the 3D scene (a lot of matrix multiplication, inverse, matrix/vector multiplication, quaternion to matrix transposition).
Or maybe mesh generation, with a lot of vector interpolation, normalization, dot/vector product.
My embryo of 3D engine is not mature enough to give proper feedback.
Be sure I'll get back to this later, this could help with performance of rendering code I'm using.
Yes, it's not urgent. I have not worked out how to do a proper realistic benchmark yet.
But at least, we have a solution if it appears passing by value becomes a performance bottleneck.
FWIW I plan to use pbr-sketch (https://github.com/AuburnSounds/dplug/tree/master/tools/pbr-sketch) as benchmark, fits my needs. However it uses a forked version of gfm:math, whose duplication is not very clean after all.
I tried with pbr-sketch.
So for Vector
case, auto ref const
doesn't seem to make any difference in performance. Pretty sure I had the same result before.
Without auto ref const:
Width = 666 height = 412
Rendered 200 times in 24.818 sec
Time samples: [126, 121, 122, 126, 121, 121, 122, 121, 121, 125, 121, 134, 149, 150, 139, 120, 123, 122, 122, 121, 123, 121, 121, 121, 124, 122, 122, 127, 120, 121, 122, 121, 120, 124, 122, 121, 122, 123, 121, 122, 121, 122, 122, 126, 127, 129, 122, 122, 122, 123, 122, 121, 152, 148, 149, 129, 121, 125, 121, 121, 121, 120, 122, 120, 121, 124, 121, 120, 120, 121, 125, 121, 120, 124, 126, 121, 121, 122, 122, 121, 121, 121, 126, 121, 122, 121, 121, 125, 121, 121, 125, 121, 138, 147, 148, 139, 121, 122, 122, 121, 120, 121, 122, 125, 120, 121, 123, 121, 120, 121, 121, 122, 121, 120, 123, 121, 121, 120, 120, 122, 129, 120, 123, 123, 123, 122, 120, 122, 120, 122, 120, 124, 124, 159, 152, 140, 123, 121, 124, 120, 121, 121, 121, 120, 122, 120, 122, 123, 121, 119, 122, 120, 122, 124, 122, 123, 120, 121, 135, 122, 122, 120, 120, 125, 121, 122, 121, 121, 123, 123, 124, 122, 122, 145, 147, 147, 134, 120, 121, 123, 121, 120, 121, 121, 121, 122, 125, 121, 123, 121, 120, 121, 120, 123, 121, 121, 124, 121, 121, 121]
Min = 119 ms
Mean = 124.09 ms per render
Writing 371238 bytes
With auto ref const wherever possible:
width = 666 height = 412
Rendered 200 times in 24.815 sec
Time samples: [125, 129, 120, 126, 122, 122, 123, 120, 124, 121, 121, 124, 123, 121, 120, 120, 121, 121, 121, 122, 123, 121, 121, 122, 121, 122, 121, 120, 128, 151, 160, 142, 120, 121, 121, 121, 121, 120, 122, 126, 122, 120, 121, 122, 130, 133, 130, 128, 122, 120, 122, 120, 124, 122, 119, 125, 121, 121, 128, 122, 149, 121, 120, 122, 120, 121, 120, 121, 131, 149, 151, 146, 124, 125, 124, 122, 128, 122, 122, 121, 122, 120, 121, 120, 126, 121, 120, 121, 120, 122, 122, 120, 120, 126, 120, 120, 121, 119, 122, 121, 124, 122, 120, 120, 120, 120, 121, 123, 119, 142, 155, 148, 131, 120, 122, 121, 120, 121, 120, 122, 121, 120, 120, 123, 121, 122, 122, 120, 121, 120, 120, 120, 121, 121, 123, 120, 121, 120, 121, 120, 124, 122, 124, 120, 121, 121, 120, 120, 120, 122, 154, 149, 150, 122, 121, 120, 121, 122, 125, 120, 120, 120, 122, 120, 120, 121, 121, 123, 121, 120, 120, 124, 120, 121, 122, 127, 122, 120, 120, 122, 121, 120, 120, 123, 122, 121, 119, 121, 121, 121, 130, 152, 152, 143, 120, 122, 122, 124, 120, 124]
Min = 119 ms
Mean = 124.075 ms per render
Writing 371238 bytes
I'll try to modify the rendering to include matrix operation so that the benchmark is useful to bench matrix * vector ops.
Testing with a lot of 3x3 matrix * multiply and matrix * vector multiply
// Without auto ref:
width = 666 height = 412
Rendered 50 times in 30.172 sec
Time samples: [604, 602, 632, 710, 581, 585, 575, 574, 574, 573, 573, 589, 646, 576, 578, 804, 664, 667, 630, 654, 654, 604, 591, 592, 597, 581, 582, 578, 655, 577, 575, 575, 575, 576, 574, 579, 634, 614, 593, 583, 578, 583, 586, 578, 581, 657, 575, 599, 579, 576]
Min = 573 ms
Mean = 603.44 ms per render
Writing 63547 bytes
// With auto ref const
width = 666 height = 412
Rendered 50 times in 30.523 sec
Time samples: [608, 605, 595, 600, 596, 649, 635, 641, 603, 607, 606, 604, 596, 630, 664, 600, 599, 598, 599, 605, 595, 599, 684, 597, 599, 594, 601, 596, 600, 605, 685, 601, 592, 602, 599, 592, 604, 596, 685, 596, 598, 598, 594, 602, 600, 593, 644, 639, 596, 597]
Min = 592 ms
Mean = 610.46 ms per render
Writing 63547 bytes
auto ref
doesn't seem to help speed at all, if anything it's a bit slower.
Compiler used: ldc v1.0.0-b2, 64-bit, -b release-nobounds
dub build --skip-registry=all -b release-nobounds --compiler=path\to\ldc2.exe
pbr-sketch -n 50
My guess is that the compiler already does some pass-by-ref optimization?