CommonSubexpressionElimination takes extremely long to run

Question

CommonSubexpressionElimination takes extremely long to run

justhecuke opened this issue 2 years ago · 4 comments

Hey,

Our team tried to enable CommonSubexpressionEliminationPass but gave up on it when it took ~50m to run by itself.

After looking at the code, the likely cause of the runtime appears to be the tight CSE -> if !chage break -> else Constant Propagation -> Local DCE loop. We haven't actually profiled the thing but, considering that CSE should take only a few seconds to run, it seems likely.

Can I ask why the tight loop was added? Does it provide some important benefit on top of CSE?

For reference, it only appears to reduce our package size by a small amount, but its odd construction has raised questions we don't know the answers to. I suspect the small size reduction may be due to Proguard already performing this optimization, but I don't have hard proof of this quite yet.

Answer 1 · 2023-02-28T21:34:33.000Z

The commit message on ffb4d1e might be helpful.

Answer 2 · 2023-02-28T22:05:21.000Z

Somewhat. The reasoning behind the loop is fair enough, I'm just unsure of the benefit and if the runtime is actually what they expect. Considering the number of times they appear to run the pass, I can't imagine that each run takes ~50m for them. Or, if it does, then they must be getting a lot more benefit out of it than I am.

Answer 3 · 2023-03-08T01:27:53.000Z

We have not found to be a problem in practice. But I could imagine some pathological code patterns where it does become a performance problem. Could you to try change the code of that loop, impose some fixed limit (maybe 10) for the iteration count? It should be semantically fine to break out of the loop early. Let us know if that solves your problem, and what you find as a reasonable limit, and then we can that as an option.

Answer 4 · 2023-03-09T20:06:58.000Z

I've modified the code to only perform a single iteration with no Constant Propagation or LocalDCE and to replace the WeakTopologicalOrdering with a simple sort of DexMethods in Purity.cpp. Our local build time went down from ~1600s (~27 minutes) to ~200s of which ~1000s was due to the WTO. Adding a second iteration makes the time go up to 250s. Adding Constant Propagation and LocalDCE ups the time to ~700s.

EDIT: Our CI time went down from ~7550s (!) to ~80s in the minimal case (vs. ~200s locally).
EDIT2: Seems like in the CI build, WTO takes ~7450s (init_method_barriers takes ~7500s) and the rest of the pass takes ~100s. Seems like our CI just really doesn't like WTO. Probably memory thrashing of some sort going on.

Strangely, it seems like replacing the WTO with a simple sort has caused a correctness issue and I'm not sure how that happened. Adding in runtime asserts also caused an issue since some of the Pure Methods, such as Long.valueOf, are not actually pure and will fail == checks between old and new locations. I'm going to see if I can't modify the assertion to Object.equals() if the type is a ref.