softdevteam/yksom

Are We Fast Yet Benchmarks

smarr opened this issue · 20 comments

smarr commented

I finally setup yksom to run the Are We Fast Yet benchmarks: http://github.com/smarr/are-we-fast-yet

CD and Havlak are failing.

Havlak seems not unusual:

yksom  --cp .:Core:CD:DeltaBlue:Havlak:Json:NBody:Richards:~/.local/yksom/SOM/Smalltalk Harness.som  Havlak 1  1500

Starting Havlak benchmark ...
thread 'main' panicked at 'Not enough stack space to execute block.', src/lib/vm/core.rs:404:13

So, is there a way to easily increase the available stack?

CD will require some debugging.
It runs into a division by zero, which likely means that there's some arithmetic operation that's not quite right:

yksom  --cp .:Core:CD:DeltaBlue:Havlak:Json:NBody:Richards:~/.local/yksom/SOM/Smalltalk Harness.som  CD 1  250
Starting CD benchmark ...
Traceback (most recent call at bottom):
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/System.som, line 46, column 10:
        ^ (application respondsTo: #run:)
              ifTrue:  [ application run: arguments ]
              ifFalse: [ application run ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Boolean.som, line 30, column 8:
        self ifTrue:  [ ^trueBlock value  ].
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/True.som, line 32, column 24:
    ifTrue:  block = ( ^block value )
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Boolean.som, line 30, column 25:
        self ifTrue:  [ ^trueBlock value  ].
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/System.som, line 47, column 23:
            ifTrue:  [ application run: arguments ]
  File /data/home/gitlab-runner/builds/d258e35c/0/sm951/awfy-runs/awfy/benchmarks/SOM/Harness.som, line 46, column 4:
    run runBenchmark.
  File /data/home/gitlab-runner/builds/d258e35c/0/sm951/awfy-runs/awfy/benchmarks/SOM/Run.som, line 50, column 4:
    self doRuns: benchmarkSuite new.
  File /data/home/gitlab-runner/builds/d258e35c/0/sm951/awfy-runs/awfy/benchmarks/SOM/Run.som, line 70, column 4:
    1 to: numIterations do: [:i |
        self measure: bench
      ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 80, column 8:
        self to: limit by: 1 do: block
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 8:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Block.som, line 40, column 8:
        block value.
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 36:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File /data/home/gitlab-runner/builds/d258e35c/0/sm951/awfy-runs/awfy/benchmarks/SOM/Run.som, line 71, column 6:
      self measure: bench
  File /data/home/gitlab-runner/builds/d258e35c/0/sm951/awfy-runs/awfy/benchmarks/SOM/Run.som, line 59, column 5:
    (bench innerBenchmarkLoop: innerIterations) ifFalse: [
  File CD/CD.som, line 50, column 20:
    ^ self verify: (self benchmark: innerIterations) resultFor: innerIterations
  File CD/CD.som, line 40, column 4:
    0 to: numFrames - 1 do: [:i |
        | time collisions |
        time := i // 10.0.
        collisions := detector handleNewFrame: (simulator simulate: time).
        actualCollisions := actualCollisions + collisions size ].
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 80, column 8:
        self to: limit by: 1 do: block
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 8:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Block.som, line 40, column 8:
        block value.
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 36:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File CD/CD.som, line 43, column 20:
      collisions := detector handleNewFrame: (simulator simulate: time).
  File CD/CollisionDetector.som, line 60, column 18:
    allReduced := self reduceCollisionSet: motions.
  File CD/CollisionDetector.som, line 160, column 4:
    motions forEach: [:motion | self draw: motion on: voxelMap ].
  File Core/Vector.som, line 69, column 4:
    first to: last - 1 do: [ :i | block value: (storage at: i) ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 80, column 8:
        self to: limit by: 1 do: block
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 8:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Block.som, line 40, column 8:
        block value.
  File /home/gitlab-runner/.local/yksom/SOM/Smalltalk/Integer.som, line 86, column 36:
        [ i <= limit ] whileTrue: [ block value: i. i := i + step ]
  File Core/Vector.som, line 69, column 34:
    first to: last - 1 do: [ :i | block value: (storage at: i) ]
  File CD/CollisionDetector.som, line 160, column 32:
    motions forEach: [:motion | self draw: motion on: voxelMap ].
  File CD/CollisionDetector.som, line 185, column 4:
    self recurse: voxelMap seen: seen voxel: (self voxelHash: motion posOne) motion: motion
  File CD/CollisionDetector.som, line 142, column 5:
    (self isInVoxel: nextVoxel motion: motion) ifFalse: [ ^ self ].
  File CD/CollisionDetector.som, line 97, column 14:
    low_x  := (v_x - r - x0) // xv.
Division by zero.
smarr commented

The overall performance, comparing the interpreter-only performance of SOM implementations looks like this:

yksom

ykSOM isn't bad :)

Which of the SOMs do the "treat ifTrue specially" thing? From memory, CSom doesn't, but SOM++ does. yksom doesn't. I don't know about the others (and my memory might be wrong anyway!).

smarr commented

PySOM and TruffleSOM do various optimizations, including inlining.
From the top of my head:

The AST interpreters are also more aggressively inlining loop constructs like #do: and #to:do:, which doesn't work well on bytecodes, because the JIT compilers lose live time info for loop indexes, which is costly in compiled code.

SOM++ does only a broken inlining of ifTrue/ifFalse.

CSOM doesn't do anything. And SOM-RS doesn't really optimize either.

When you say "inline", my memory of SOM++ is that it hardcodes ifTrue:s functionality, so changing it in a .som file doesn't actually change program behaviour. Do PySOM and TruffleSOM do that, or do they inline the SOM code itself?

Basically, I'm trying to work out what the comparison is :) From memory, the SOM++ ifTrue: cheat (which, since it changes the language's semantics I do think of as a "cheat" rather than a transparent optimisation) was a substantial performance increase. I might be wrong about that though!

smarr commented

PySOM and TruffleSOM implement inlining of the blocks associated with these operations.
In the bytecode interpreters, I use "standard Smalltalk cheating" and map the various things to basic jump bytecodes.

  public static final byte JUMP                  = 44;
  public static final byte JUMP_ON_TRUE_TOP_NIL  = 45;
  public static final byte JUMP_ON_FALSE_TOP_NIL = 46;
  public static final byte JUMP_ON_TRUE_POP      = 47;
  public static final byte JUMP_ON_FALSE_POP     = 48;
  public static final byte JUMP_BACKWARDS        = 49;

This means, if you change True.som and False.som to do extra things in the ifTrue: implementation for instance, that extra thing is just ignored. There's also no check that the core-lib is unchanged. I could and probably should do that, but haven't.

Have a look at

I tried to be good and document some of these bits:

Unfortunately, I didn't link the ReBench reports back then.
Though, it's a mixed bag when considering JIT compilation.

Can I suggest that none of the SOMs should do the "standard Smalltalk cheating" and all SOMs should do only transparent optimisations? I think that it's very hard (and, arguably, unfair) to compare the performance VMs doing transparent optimisations (including, by the sounds of it, PySOM and TruffleSOM!) vs. those that are not (e.g. SOM++).

smarr commented

Well... I am convinced enough that all these optimizations can be made fully transparent with only minimal startup cost.

But, I haven't put in the engineering to actually do it...

So... yeah.

Though, I am not sure what's unfair about comparing VMs that do and that don't do optimizations. How else would you know which optimizations are beneficial?

When I say "transparent" optimisation I mean "any change the user makes to any .som file works as they expect". A "non-transparent" optimisation (e.g. SOM++'s ifTrue: thing) violates that principle: I can change an ifTrue: method and it won't have the effect I expect when I run it in SOM++.

FWIW, I think a fully transparent ifTrue: optimisation in SOM++ (and yksom) is non-trivial, and has run-time costs. That's why I think it really muddles the comparison.

smarr commented

To make it transparent I was thinking of possibly doing one of the following:

  • checksum the core-lib
  • check token stream of method for equivalence
  • or check AST or bytecode of method for equivalence

If the check fails, disable, and possibly warn the user.
This has only startup cost, and a somewhat brittle performance profile.
Though, most languages have certain dependencies between the core library and VM, so, I wouldn't consider this approach unusual.

I wouldn't go so far as to make it fully generic...

I think that even with these rules, users can't override ifTrue: in subclasses?

smarr commented

Override is just a cheap guard and a resend away.
The way the parsing+inlining works in TruffleSOM and PySOM means that the block is right there and can be passed on to an overridden method.

I have to admit it's not implemented though...

(Strictly speaking that's not entirely true. It's implemented for #do: in the AST interpreter because that's used on Vector and Array, and the resend is used in most primitives).

On class loading, one could even track whether there is any override in the first place.

And only if that's the case, use the variant that needs to check.
And then, as soon as the result is not an actual boolean (for the #ifTrue:/#ifFalse: case, just bail, and resend as a message. The logic is currently used for all primitive operator optimizations, things like + and what not.

Most of the pieces are there.
That's kind of why I never really bothered.
I convinced myself that I think that I know how it can work... (I know, dangerous)
And that it's not particularly exciting to actually make it work...

I fear it will take an external stimulus, like someone one the internet just being annoying enough to make me feel like I have to proof that it works 😉

Annoying? Moi? I don't mind what we do, so long as it's the same for all SOMs, and I know I'm comparing apples to apples. Since I'm stupid and lazy, I'd rather not implement ifTrue: complexity, but if all the other SOMs do so, then yksom will do so too.

smarr commented

I am not sure what exactly you imagine with apples to apples.

You did your own design without really looking at the existing implementations, I believe.
So, it's not apples to apples in the sense of trying to have mostly similar interpreters just in different implementation languages, right?

As far as I can tell, your design differs in important aspects.
I think you have a single continuous stack, instead of frame objects, and your bytecodes are not identical to the SOM ones, not even the basic ones.
They are not too far off, but not the same.

I am not sure what other differences there are, but apples to apples seems more likely between TruffleSOM and PySOM, which have relatively few differences, mostly motivated by the underlying implementation platform.

There are, of course, always many types of apples, but I think that we can probably agree that having different implementations literally executing different user code is not what we want :)

Put another way: if we say that all implementations must do the SOM++ ifTrue: style execution, I can live with that (even though, I admit, my prejudices dislike it), provided all SOMs do that. At the moment, comparing say SOM++ and yksom is too fraught, because the ifTrue: difference probably dominates performance differences.

smarr commented

Hm. I don't know.

I think, for the comparison you have in mind, it would be better to simply rip out the optimizations in question and then run the comparison that you want to do, without "mandating" what a SOM should be doing.

I'd think, a SOM should only be mandated to "adhere to a specification" (which isn't existing, and very badly approximated by the unit tests) similar to ECMAScript.
Anything underneath the surface is fair game.

After all, different SOMs have different reasons for existing. Some exist just for the joy of it, like SOM-RS or JsSOM...

I think, for the comparison you have in mind, it would be better to simply rip out the optimizations in question and then run the comparison that you want to do, without "mandating" what a SOM should be doing.

OK.

I also realised (sorry it's been a busy day!) that I missed some bits:

So, is there a way to easily increase the available stack?

I guess this is your Unix stack limit? ulimit -s should fix that?

It runs into a division by zero, which likely means that there's some arithmetic operation that's not quite right:

Eek. Dunno! If you have any suggestions, I'm all ears!

smarr commented

So, is there a way to easily increase the available stack?

I guess this is your Unix stack limit? ulimit -s should fix that?

Ehm, no. This is your SOM stack, which is about to overflow here:

https://github.com/softdevteam/yksom/blob/master/src/lib/vm/core.rs#L403-L405

If I glance at the code correctly, I'd need to increase this SOM_STACK_LEN here https://github.com/softdevteam/yksom/blob/master/src/lib/vm/somstack.rs#L7

Eek. Dunno! If you have any suggestions, I'm all ears!

Yeah. This is more hairy to debug.

I'd probably start with printing out computational results in the benchmark code and diff the log between ykSOM and SOM-java perhaps to see where they start to diverge to narrow down what the cause could be.

If I glance at the code correctly, I'd need to increase this SOM_STACK_LEN here https://github.com/softdevteam/yksom/blob/master/src/lib/vm/somstack.rs#L7

I'd forgotten about that! I wonder what a safe limit might be? I'd be fine with doubling or quadrupling it if you think that's necessary.

smarr commented
# HACK: Havlak needs a lot of stack to run reliably...
if 'Havlak' in args.args:
  # double the standard stack size
  # we don't do it always, because...
  JAVA_ARGS += ['-Xss3072k']

I use this in TruffleSOM... Not, well, ideal... No idea how this would translate to your case. I'd need to test. But not today.
Apparently I am writing a proposal tonight...

Oh, interesting, you have to increase the Java stack size! At some point you could definitely also explode the Rust stack too, but my experience is that most shells set that to a fairly high value, so... hopefully not!