utwente-fmt/vercors

Performance issues

Closed this issue · 13 comments

So @pieter-bos already noticed sometimes that the C files generated by Halide can be quite slow in being processed. I think I've found the worst offender yet.

This file took 2030s to go through all passes, before sending to silicon. So about 33 minutes. (latest dev, commit: 003a9ed )

SubDirectionHalide-NCB-L-P.c.txt

I do not fully understand why it is slow, but I've uploaded the dump I got from IntelliJ (see how to open it here: https://www.jetbrains.com/help/idea/create-a-profiling-report.html#open).

https://drive.google.com/file/d/1Vaj8QOgd3CAqDsUS7aedCC25zpAMUbDU/view?usp=sharing

The first insight I got is that maybe it is related to 'freeze' in vct/col/util/Scopes.scala
image

I actually noticed the freeze as being a bottleneck before but it disappeared completely when I wasn't profiling. And getting rid of the ThreadLocal in the SuccessionMap made things a lot faster while profiling but made no difference when not profiling

Ah, that makes sense, I was already doubting myself on why it was 33 minutes.

So without profiling it takes: 1m24,166s

Thus still slow, although not as horrible as 33 minutes.

So with 'disabling' ThreadLocal, i get this timing:

5m41,707s

But maybe it's just 'unstable' that sometimes it's faster or slower.

It mostly seems slow with transformations 'adt*' So 'adtFrac', 'adtNothing', etc is slow for instance.

So pretty sure it's somewhere there.

With these passes disabled in Transformation.scala:
// ImportOption.withArg(adtImporter),
// ImportFrac.withArg(adtImporter),
// ImportNothing.withArg(adtImporter),
// ImportVoid.withArg(adtImporter),
// ImportNull.withArg(adtImporter),
// ImportAny.withArg(adtImporter),

it finishes all the transformations in 30s.

Hmm, new observation, when removing ThreadLocal both from vct/col/util/SuccessionMap.scala and hre/util/ScopedStack.scala I get to a respectable 0m46,762s.

And now all passes are enabled.

So with 'disabling' ThreadLocal, i get this timing:

5m41,707s

But maybe it's just 'unstable' that sometimes it's faster or slower.

It mostly seems slow with transformations 'adt*' So 'adtFrac', 'adtNothing', etc is slow for instance.

Here I only had it disabled at vct/col/util/SuccessionMap.scala

Enabling the threadlocal I get a timing of '00:06:37' but now it was stuck in other passes (evaluationTarget for instance).

So my conclusion for now that with threadLocal it can be hard to predict, but sometimes makes it a lot slower (and sometimes not).

Whilst without threadLocal, it is consistent in its speed and seems faster.

@pieter-bos are the threadLocal necessary? I think I don't fully understand what it is doing and why it would be needed.

Edit: Very inconsistent. With threadlocal enabled, my last run of the same file took over 10m57 (I interrupted execution after that)

Transformations are set up in such a way that they can be parallelized, so ScopedStack and SuccessionMap have a nice defined notion what happens when they are used in a parallel context. I am wildly confused why ThreadLocal is so slow, the implementation reads to me like it's a tiny nearly constant-time map, so I don't understand a 10x speedup. The upside here is that you can essentially make VerCors parallel by rewriting all the global declarations in parallel and it pretty much just works. Only reason that is not in mainline VerCors is that I'm not certain we should want to support it.

Disabling the imports helping so much is unfortunately expected, there is a known performance issue with all CoercingRewriters. The root of the issue is that match statements with a lot of case arms are still implemented as a linear search, instead of some type of lookup table by type. In particular we spent nearly all of our time checking the type of Expr in CoercingRewriter.coerce(Expr). I suspect that this would e.g. be resolved if we switch to scala 3 and translate Node to an enum - I would appreciate it if someone investigates this.

I'll also have a look at the profile and see if there are any hints left :)

I think the profiler just does something which makes the ThreadLocal way more expensive because indeed it shouldn't be that slow

I think the profiler just does something which makes the ThreadLocal way more expensive because indeed it shouldn't be that slow

To be clear. Also without the profiler ThreadLocal seems to make my runs a lot slower/unpredictable.

Transformations are set up in such a way that they can be parallelized, so ScopedStack and SuccessionMap have a nice defined notion what happens when they are used in a parallel context. I am wildly confused why ThreadLocal is so slow, the implementation reads to me like it's a tiny nearly constant-time map, so I don't understand a 10x speedup. The upside here is that you can essentially make VerCors parallel by rewriting all the global declarations in parallel and it pretty much just works. Only reason that is not in mainline VerCors is that I'm not certain we should want to support it.

Disabling the imports helping so much is unfortunately expected, there is a known performance issue with all CoercingRewriters. The root of the issue is that match statements with a lot of case arms are still implemented as a linear search, instead of some type of lookup table by type. In particular we spent nearly all of our time checking the type of Expr in CoercingRewriter.coerce(Expr). I suspect that this would e.g. be resolved if we switch to scala 3 and translate Node to an enum - I would appreciate it if someone investigates this.

I'll also have a look at the profile and see if there are any hints left :)

This is my latest profiler information, with the ThreadLocal disabled.
Blowup_2024_06_25_113758.zip

$ git diff
diff --git a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
index b2bb1bc8a..be98cef9c 100644
--- a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
+++ b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
@@ -54,7 +54,7 @@ trait TypeImpl[G] extends TypeFamilyOps[G] {
   def particularize(substitutions: Map[Variable[G], Type[G]]): Type[G] = {
     case object Particularize extends NonLatchingRewriter[G, G] {
       case object IdentitySuccessorsProvider
-          extends SuccessorsProviderTrafo[G, G](allScopes.freeze) {
+          extends SuccessorsProviderTrafo[G, G](null) {
         override def preTransform[I <: Declaration[G], O <: Declaration[G]](
             pre: I
         ): Option[O] = Some(pre.asInstanceOf[O])

Could you try this? Most of the freeze calls seem to be from type-checking ADT invocation calls that need to monomorphize over the supplied type arguments, but it doesn't actually need the successors.

Otherwise it seems the design is a bit flawed. Every AbstractRewriter has a Scopes to account scopes, then this is abstracted to a SuccessorsProvider to do custom stuff - but by default succProvider already calls freeze on allScopes, which considers all scopes for every reference. It should be understood that a SuccessorsProvider has not yet frozen the situation and instead only does this after choosing a declaration kind - this should save us like 95% of the remaining calls.

$ git diff
diff --git a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
index b2bb1bc8a..be98cef9c 100644
--- a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
+++ b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
@@ -54,7 +54,7 @@ trait TypeImpl[G] extends TypeFamilyOps[G] {
   def particularize(substitutions: Map[Variable[G], Type[G]]): Type[G] = {
     case object Particularize extends NonLatchingRewriter[G, G] {
       case object IdentitySuccessorsProvider
-          extends SuccessorsProviderTrafo[G, G](allScopes.freeze) {
+          extends SuccessorsProviderTrafo[G, G](null) {
         override def preTransform[I <: Declaration[G], O <: Declaration[G]](
             pre: I
         ): Option[O] = Some(pre.asInstanceOf[O])

Could you try this? Most of the freeze calls seem to be from type-checking ADT invocation calls that need to monomorphize over the supplied type arguments, but it doesn't actually need the successors.

This seems to work, I ran it about 4 times and it consistently around 45 secs now.

In the above PR stricly the transformation stage takes 28 seconds, of which 16 seconds is spent in CoercingRewriter.coerce(Expr) - mostly to type check expressions.