emscripten-core/emscripten

Emscripten produces degenerate if-else chain for switch inside a loop

rfk opened this issue · 19 comments

rfk commented

I've found a certain pattern of spaghetti-goto code that seems to confound the relooper, causing it to produce a degenerate if-elif chain when handling a switch inside a loop.

Full test case: https://gist.github.com/rfk/7889732

The generated code uses a switch to set a flag variable, then uses a big if-elif chain to decide what to do based on this flag variable, like so:

   LOOP: while (1) {
     switch (input[i]) {
       case '0':
         flag = 1;
         break LOOP;
         break;
       case '1':
         flag = 2;
         break LOOP;
         break;
       ...etc...
     }
     i++;
   }
   if (flag == 1) {
     ...code for case 1...
   } else if (flag == 2) {
     ...code for case 2...
   }

If there are many cases in this switch, the result can be some extremely inefficient code. I discovered the issue in the code generated by PyPy for its bytecode dispatch loop, which has more than 200 cases in a switch-in-a-loop like this.

It seems "obvious" that the generated code could be improved by pushing these blocks inside the switch statement, like so:

   LOOP: while (1) {
     switch (input[i]) {
       case '0':
         ...inline code for case 1...
         break LOOP;
         break;
       case '1':
         ...inline code for case 2...
         break LOOP;
         break;
       ...etc...
     }
     i++;
   }

So I'm hopeful that, while it's probably quite an obscure corner case, it can be improved without too much hassle. I'd be happy to try my hand at adding this optimisation - can you suggest a good place to start poking?

The relooper by default only leaves code in loops that is necessarily part of the loop (i.e. that can return to the loop start). But look in src/relooper/Relooper.cpp at line 545 and 569, each of those is a chunk of commented out code that hoists more stuff, which is I think what you want. A good starting point might be to try to uncomment the second one and see how that affects things here.

Note that you need to clear the cache to get the relooper rebuilt during development (emcc --clear-cache). Also perhaps useful, there are some .sh files and a fuzzer in the relooper directory to help with testing it standalone.

rfk commented

Great! The "// We can avoid multiple next entries by hoisting them into the loop." code from line 569 seems to do the trick for this testcase, I will try it out on the full interpreter loop and report back.

Are there known cases where this de-optimises the code, or would it be safe to enable by default?

I think both regressed the test suite in places. But might be some
heuristics we can tweak to avoid that. We would need a benchmark for the
benchmark suite to justify this, btw.

On Tue, Dec 10, 2013 at 3:18 PM, Ryan Kelly notifications@github.comwrote:

Great! The "// We can avoid multiple next entries by hoisting them into
the loop." code from line 569 seems to do the trick for this testcase, I
will try it out on the full interpreter loop and report back.

Are there known cases where this de-optimises the code, or would it be
safe to enable by default?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1909#issuecomment-30278716
.

rfk commented

Result: the non-JIT PyPy interpreter went from 5K pystones/sec to 9K with this change. So it definitely makes a big difference for my use-case.

We would need a benchmark for the benchmark suite to justify this, btw.

Yep. I'll keep experimenting.

Enabling the hoisting code makes DOSBox a lot faster. The big switch statement for decoding x86 instructions in CPU_Core_Normal_Run() changes from having 527 label tests to having 161 label tests. It seems all of those un-hoisted cases set label to one of two non-zero values, which are then dealt with in the next block with two label tests for those values. I expect that those 161 label tests can also be hoisted with improved code.

I'm not seeing any problems. Does anyone know what tests in particular fail with this code enabled?

I have not run the test suite on that for a long time, so not sure of the current status. A good starting point would be to run the benchmark suite with and without the patch, to check for speed regressions, python tests/runner.py benchmark (see tests/test_benchmark.py for more details, you may want to edit that to pick which js engine is used etc.)

We should certainly enable this if it provides benefits.

I was just looking at how this code doesn't eliminate some labels. This pattern causes every case containing a continue to not be hoisted, via the "abort this hoisting" code.

volatile int x;

int main(int argc, char **argv) {
    while (argc-- > 0) {
restart_opcode:
        switch (x) {
        case 1: x += 1; break;
        case 2: x += 2; break;
        case 3: x += 3; continue;
        case 4: x += 4; continue;
        case 5: x += 5; goto restart_opcode;
        case 6: x += 6; goto restart_opcode;
        }
        x += 100;
    }
    return 0;
}

Calling NextEntries.insert(Target); instead of setting abort gets rid of this problem. Even in the huge DOSBox CPU_Core_Normal_Run() function, I see no inappropriate use of labels, and the .js file gets a bit smaller. DOSBox runs even faster. However, I can't claim I understand the relooper sufficiently to know the full implications of that change, and I haven't run any tests yet. Update: This change isn't correct. When running the fuzzer I saw it move a part of a loop from below into a loop above, including the labelled continue, causing an error. Some condition is needed for deciding between setting abort and NextEntries.insert(Target).

Unfortunately, even if Emscripten produces a perfect switch statement, performance can be terrible in Chrome and anything else using the V8 JavaScript engine. If a switch contains too many cases, the function won't be optimized:
http://code.google.com/p/v8/issues/detail?id=2275
Search for kCaseClauseLimit in src/hydrogen.cc. Currently, the limit is 128.

Is that code sample representative of DOSBox? Would be good to have a good testcase that is also a benchmark, that is that runs a fixed workload and we can measure improvements by seeing how long it runs.

The code above represents a subset of the control flow possibilities in CPU_Core_Normal_Run() from src/cpu/core_normal.cpp in DOSBox 0.74. (CPU_Core_Simple_Run() is similar.) It is a loop executing x86 instructions one by one. Break, continue and goto restart_opcode seem to be the minimal set needed to not get full hoisting with the hoisting code enabled. DOSBox also has return statements and goto illegal_opcode, which is a label at the start of the default case.

All the actual work the code did has been replaced with single operations on volatile variable x. This simplifies things to the maximum while still preventing the optimizer from eliminating any key parts of the structure, and making it easy to see how the C++ was translated into JavaScript.

I could easily make this simple example into a benchmark exercising all paths equally, but I couldn't claim that it represents the DOSBox workload, or even the control flow part of the workload. Another possibility would be gathering statistical information on which control flow paths are taken when DOSBox is running some x86 code, and exercising the paths in proportion to those statistics. Actually benchmarking all of DOSBox x86 emulation is probably undesirable because the code is huge and GPL licensed.

BTW In my DOSBox port I wrote a script which automatically transforms cases into functions and uses an array of function pointers instead of a switch statement. It is a lot faster in Chrome. Surprisingly, it is also faster in Firefox, even compared to a fully hoisted switch statement created using this flawed patch. So, switch statement improvements probably won't greatly improve DOSBox performance unless it is built without that transformation.

Yes, it can be hard to reduce a big app into a benchmark. But the goal is a small benchmark that, if we optimize it, we have a good reason to believe the full app will be optimized to some degree as well. If you can make a benchmark that gets somewhat close to that it would be very helpful here, even if we have no perfect guarantees.

Interesting about the transformation. For chrome, i assume the reason is that they do not enter crankshaft as in that v8 issue. For firefox, it is less obvious, but my suspicion is that very large functions end up with poorer register allocation behavior or something like that, so larger sizes correlate with slowness. On what level does your script work? (c source, llvm ir, generated js) I wonder if we should consider adding an optimization pass for that, if it makes a big difference.

Okay, I will try to make a simple benchmark which approximates the control-flow workload of the DOSBox CPU emulator.

Regarding the unexpected speedup in Firefox, I think it might be due to the huge number of variables declared and initialized to zero at the beginning of the function. I just counted them by compiling with -g and using grep -A 1 '^function.*Normal_Run' dosbox.js | tr , '\n' | grep -c =, and the answer is 3149. The vast majority are only used in particular cases, for executing particular instructions, and declared within braces within the case or possibly created during compilation. The transformation reduces that number to 11. Of course the functions created from cases have their own variables, but there won't be a huge number at the same time. (Update: Sorry, it's not quite that bad. A normal -O2 build without -g only has 204 variables in the function.)

JavaScript doesn't have block scope, and asm.js seems to require variable declarations to be at the start of a function. So, the only way to move them elsewhere is to split the function. The number could be reduced without splitting the function by using the same variable for multiple purposes. It would be kind of like register allocation.

My transformation uses preprocessed C source because that seems to be the easiest place in this particular instance. The functions return a number which determines what happens after the function returns, similarly to how Emscripten uses label. The only non-trivial part was determining which break statements break from the switch being replaced, and which are internal to the function.

Can you try -O3 on latest incoming as well? It has a new optimization that should reduce variables even further, dramatically in some cases.

Using f5c957b, the hoisting patch, and -O3, number of variables has been reduced to 33. Nice!

The un-transformed CPU core using a huge switch statement is still a bit slower than the transformed core using separate functions, but the difference is small. In the first level of Duke Nukem 1 I was getting 25% to 26% untransformed with hoisting patch vs. 22% to 24% transformed.

Has -O3 become safer? I noticed I'm not getting warnings about it applying unsafe optimizations. Both DOSBox and JSMESS seem to work fine with -O3.

Without the hoisting patch, the number of variables is 32, and CPU usage is 50% to 55%.

rfk commented

Has -O3 become safer?

Yes, it now enables some safe-but-quite-expensive optimizations while the previous unsafe optimizations are available only via explicit flags.

Surprisingly, enabling hoisting by only changing the "#if 0" to "#if 1" makes DOSBox slower! CPU usage in the same game is around 63%. I repeated this test several times, an the results are consistent. Hoisting does decrease total "else if" occurrences in the function from 567 to 187. So, hoisting only seems to benefit DOSBox if abort = true; is also replaced by NextEntries.insert(Target);, which causes errors when linking some other code.

In the profiler, with asm.js disabled the CPU function uses 96.7% of total time in the unaltered version, and 97.1% with #if 1. I cannot profile individual functions in Firefox with asm.js enabled, but it seems highly probable that the increase in CPU usage is from the CPU function.

Interesting. Not sure how to explain that. But it sounds like hoisting is not necessary then?

Has anyone encountered this problem with recent versions, using fastcomp? I was just looking through CPU_Core_Normal_Run() in DOSBox without my source-level transformation, compiled with Emscripten 1.29.6 at -O3. The JS code looks good. There is a switch setting a variable and another testing the variable it set, but everything is done via switches, so it should be fast. It performs well in Firefox 35.

Performance is horrible in Chrome 40, but that is a V8 JS engine issue with big switches. It may make sense to have some workaround for that V8 issue in Emscripten, but at least this if-else chain problem might be fixed now.

Yeah, hopefully this should now be ok. The relooper emits switches when LLVM suggests to.

Regarding chrome performance, they are now using turbofan, a new backend, instead of crankshaft, for asm.js content, which is promising (not on release yet, but on dev). Hopefuly turbofan won't have the same limitations on switch size. However, when I tried the massive-sqlite benchmark - which tests among other things a very large switch - it didn't run correctly, so might be too early to test. But hopefully soon. (bug: https://code.google.com/p/v8/issues/detail?id=3800 )

stale commented

This issue has been automatically marked as stale because there has been no activity in the past 2 years. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.