Performance issues
sakehl opened this issue · 13 comments
So @pieter-bos already noticed sometimes that the C files generated by Halide can be quite slow in being processed. I think I've found the worst offender yet.
This file took 2030s to go through all passes, before sending to silicon. So about 33 minutes. (latest dev, commit: 003a9ed )
SubDirectionHalide-NCB-L-P.c.txt
I do not fully understand why it is slow, but I've uploaded the dump I got from IntelliJ (see how to open it here: https://www.jetbrains.com/help/idea/create-a-profiling-report.html#open).
https://drive.google.com/file/d/1Vaj8QOgd3CAqDsUS7aedCC25zpAMUbDU/view?usp=sharing
The first insight I got is that maybe it is related to 'freeze' in vct/col/util/Scopes.scala
I actually noticed the freeze as being a bottleneck before but it disappeared completely when I wasn't profiling. And getting rid of the ThreadLocal in the SuccessionMap made things a lot faster while profiling but made no difference when not profiling
Ah, that makes sense, I was already doubting myself on why it was 33 minutes.
So without profiling it takes: 1m24,166s
Thus still slow, although not as horrible as 33 minutes.
So with 'disabling' ThreadLocal, i get this timing:
5m41,707s
But maybe it's just 'unstable' that sometimes it's faster or slower.
It mostly seems slow with transformations 'adt*' So 'adtFrac', 'adtNothing', etc is slow for instance.
So pretty sure it's somewhere there.
With these passes disabled in Transformation.scala:
// ImportOption.withArg(adtImporter),
// ImportFrac.withArg(adtImporter),
// ImportNothing.withArg(adtImporter),
// ImportVoid.withArg(adtImporter),
// ImportNull.withArg(adtImporter),
// ImportAny.withArg(adtImporter),
it finishes all the transformations in 30s.
Hmm, new observation, when removing ThreadLocal
both from vct/col/util/SuccessionMap.scala
and hre/util/ScopedStack.scala
I get to a respectable 0m46,762s.
And now all passes are enabled.
So with 'disabling' ThreadLocal, i get this timing:
5m41,707s
But maybe it's just 'unstable' that sometimes it's faster or slower.
It mostly seems slow with transformations 'adt*' So 'adtFrac', 'adtNothing', etc is slow for instance.
Here I only had it disabled at vct/col/util/SuccessionMap.scala
Enabling the threadlocal I get a timing of '00:06:37' but now it was stuck in other passes (evaluationTarget for instance).
So my conclusion for now that with threadLocal it can be hard to predict, but sometimes makes it a lot slower (and sometimes not).
Whilst without threadLocal, it is consistent in its speed and seems faster.
@pieter-bos are the threadLocal
necessary? I think I don't fully understand what it is doing and why it would be needed.
Edit: Very inconsistent. With threadlocal enabled, my last run of the same file took over 10m57 (I interrupted execution after that)
Transformations are set up in such a way that they can be parallelized, so ScopedStack
and SuccessionMap
have a nice defined notion what happens when they are used in a parallel context. I am wildly confused why ThreadLocal
is so slow, the implementation reads to me like it's a tiny nearly constant-time map, so I don't understand a 10x speedup. The upside here is that you can essentially make VerCors parallel by rewriting all the global declarations in parallel and it pretty much just works. Only reason that is not in mainline VerCors is that I'm not certain we should want to support it.
Disabling the imports helping so much is unfortunately expected, there is a known performance issue with all CoercingRewriter
s. The root of the issue is that match
statements with a lot of case arms are still implemented as a linear search, instead of some type of lookup table by type. In particular we spent nearly all of our time checking the type of Expr
in CoercingRewriter.coerce(Expr)
. I suspect that this would e.g. be resolved if we switch to scala 3 and translate Node
to an enum
- I would appreciate it if someone investigates this.
I'll also have a look at the profile and see if there are any hints left :)
I think the profiler just does something which makes the ThreadLocal way more expensive because indeed it shouldn't be that slow
I think the profiler just does something which makes the ThreadLocal way more expensive because indeed it shouldn't be that slow
To be clear. Also without the profiler ThreadLocal seems to make my runs a lot slower/unpredictable.
Transformations are set up in such a way that they can be parallelized, so
ScopedStack
andSuccessionMap
have a nice defined notion what happens when they are used in a parallel context. I am wildly confused whyThreadLocal
is so slow, the implementation reads to me like it's a tiny nearly constant-time map, so I don't understand a 10x speedup. The upside here is that you can essentially make VerCors parallel by rewriting all the global declarations in parallel and it pretty much just works. Only reason that is not in mainline VerCors is that I'm not certain we should want to support it.Disabling the imports helping so much is unfortunately expected, there is a known performance issue with all
CoercingRewriter
s. The root of the issue is thatmatch
statements with a lot of case arms are still implemented as a linear search, instead of some type of lookup table by type. In particular we spent nearly all of our time checking the type ofExpr
inCoercingRewriter.coerce(Expr)
. I suspect that this would e.g. be resolved if we switch to scala 3 and translateNode
to anenum
- I would appreciate it if someone investigates this.I'll also have a look at the profile and see if there are any hints left :)
This is my latest profiler information, with the ThreadLocal disabled.
Blowup_2024_06_25_113758.zip
$ git diff
diff --git a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
index b2bb1bc8a..be98cef9c 100644
--- a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
+++ b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala
@@ -54,7 +54,7 @@ trait TypeImpl[G] extends TypeFamilyOps[G] {
def particularize(substitutions: Map[Variable[G], Type[G]]): Type[G] = {
case object Particularize extends NonLatchingRewriter[G, G] {
case object IdentitySuccessorsProvider
- extends SuccessorsProviderTrafo[G, G](allScopes.freeze) {
+ extends SuccessorsProviderTrafo[G, G](null) {
override def preTransform[I <: Declaration[G], O <: Declaration[G]](
pre: I
): Option[O] = Some(pre.asInstanceOf[O])
Could you try this? Most of the freeze calls seem to be from type-checking ADT invocation calls that need to monomorphize over the supplied type arguments, but it doesn't actually need the successors.
Otherwise it seems the design is a bit flawed. Every AbstractRewriter
has a Scopes
to account scopes, then this is abstracted to a SuccessorsProvider
to do custom stuff - but by default succProvider
already calls freeze
on allScopes
, which considers all scopes for every reference. It should be understood that a SuccessorsProvider
has not yet frozen the situation and instead only does this after choosing a declaration kind - this should save us like 95% of the remaining calls.
$ git diff diff --git a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala index b2bb1bc8a..be98cef9c 100644 --- a/src/col/vct/col/ast/type/typeclass/TypeImpl.scala +++ b/src/col/vct/col/ast/type/typeclass/TypeImpl.scala @@ -54,7 +54,7 @@ trait TypeImpl[G] extends TypeFamilyOps[G] { def particularize(substitutions: Map[Variable[G], Type[G]]): Type[G] = { case object Particularize extends NonLatchingRewriter[G, G] { case object IdentitySuccessorsProvider - extends SuccessorsProviderTrafo[G, G](allScopes.freeze) { + extends SuccessorsProviderTrafo[G, G](null) { override def preTransform[I <: Declaration[G], O <: Declaration[G]]( pre: I ): Option[O] = Some(pre.asInstanceOf[O])Could you try this? Most of the freeze calls seem to be from type-checking ADT invocation calls that need to monomorphize over the supplied type arguments, but it doesn't actually need the successors.
This seems to work, I ran it about 4 times and it consistently around 45 secs now.
In the above PR stricly the transformation stage takes 28 seconds, of which 16 seconds is spent in CoercingRewriter.coerce(Expr)
- mostly to type check expressions.