adobe-flash/crossbridge

performance aspects of crossbridge code

ddyer0 opened this issue · 2 comments

I have an unexpected (to me) result, comparing native C code with the same
code executed by crossbridge, to the same code manually translated to actionscript.

The relative speeds are

2.043 native windows code
6.707 crossbridge compiled native code
25.3 native actionscript code

Losing a factor of 3.7 going from native C to SWC is not too bad, but I'm totally
shocked that the carefully written actionscript, using exactly the same algorithm,
should be a factor of 10 slower than native C. The code in question is crunching
numbers with doubles (in C) vs. with :Number in actionscript.

Native C:
double m = min (nr, min (ng, nb));
double nm = nv-m;
double ns = ns = nm / nv ;
double r1 = (nv - nr) / nm ;
double g1 = (nv - ng) / nm ;
double b1 = (nv - nb) / nm ;
double nh;

if (nv == nr)
{
    if (m == ng)
        nh = 5.0 + b1 ;
    else
        nh = 1.0 - g1 ;
}

else if (nv == ng)
{
    if (m == nb)
        nh = 1.0 + r1 ;
    else
        nh = 3.0 - b1 ;
}

else if (nv == nb)
{
    if (m == nr)
        nh = 3.0 + g1 ;
    else
        nh = 5.0 - r1 ;
}

Actionscript:

var m:Number = (nr<ng) ? ((nr<nb) ? nr : nb) : ((ng<nb) ? ng : nb);
var mm:Number = (nv - m);
var ns:Number = mm / nv ;
var r1:Number = (nv - nr) / mm ;   
var g1:Number = (nv - ng) / mm ;
var b1:Number = (nv - nb) / mm ;
var nh:Number =
    (nv==nb)
        ? ((m == nr) ? (3.0 + g1) : (5.0 - r1)) 
        : ((nv == ng)
            ? ((m == nb) ? (1.0 + r1) : (3.0 - b1))
            : ((m == ng) ? (5.0 + b1) : (1.0 - g1)));

This may seem strange, but try using if conditionals instead of the ternary ops in the AS3, and see if that gives you better results. I can't say that I've examined the .abc output generated by nested ternary operations, but in my experience, AS3 handles execution branching in an inconsistent way. I agree that that number is surprising, but I have also seen various "swings" in timing AS3 algorithms. There are a few variables at play including Flash Player version, Debugger vs Release player. In browser versus stand-alone, etc...

Also, are you running that code on repeat in a tight loop and taking averages? Are you running in a function?

I'd guess that code should execute at about double the time of crossbridge, and the big reason why that's the case is that the crossbridge compiles to code that performs these operations on the domain memory byte array, and the AS3 code is going to create managed objects. If you think about it in those terms, simple load/compare/store operations on a byte array are going to be a great deal faster.

I won't lie, I'm surprised that the crossbridge code runs that much slower than the native windows test, but a good way to generically group Crossbridge performance is to compare to .NET execution. Crossbridge should be as fast (if not slightly faster) in some calculations as compared to .NET.

I'm running a real test with 2^24 x 2 distinct function calls in the loop. From other experiments, I've determined that a lot of the difference is function call overhead.

"Everything" in C is 2 seconds.

"Everything" cross-compiled is 6 seconds. Of that 6 seconds, about half is function
call overhead for internal calls, presumably not using the same stack frames as normal
AS3; the function arguments are doubles and pointers to double, which would not
be possible in normal AS3. So there's approximately a 3x speed penalty for cross-compiled
code (of this type).

The as3 test loop overhead, including the function calls, is about 13 seconds
The actual number crunching is about 10 seconds in native as3; compared to
about 3 seconds in cross compiled code, so there's a further 3x penalty for using proper
as3 data and stack structures.

Bottom line, in ballpark numbers: Cross compiling has a 3x speed penalty, but still 3x faster
than writing pure as3.

The part of this that is surprising to me is that the cross compiled code
is so much better than AS3 - I would have expected it to be much closer in speed.

Also note that this is not necessarily representative of what can be achieved
in pure byte-pushing code. My test is heavy on floating point and function calls.