AnyDSL/MimIR

optimization exhibits non-deterministic behavior

Opened this issue · 4 comments

Sometimes, the behavior of the optimization pipeline seems to be non-deterministic.

Example:
./build/bin/thorin -d mem -o - lit/mem/no_mem.thorin -VVVV
in
https://github.com/NeuralCoder3/thorin2/tree/ad_ptr_merge
702d848

The issue might be due to the add_mem optimization, the pipeline builder, or an underlying bug in thorin.

This behavior might also be a side effect of the previous (not merged yet) changes to mem and clos conv with long-reaching impact that did not manifest up to now.

Yes. this is super annoying. Another source is this:

world.app(emit1(), emit2());

It's implementation defined whether emit1() is happened first or second. This code has different behavior on different compilers/OS's.

I have implemented the --trace-gids switch that we could somehow use to test for this in our CI.

The issue happens only sometimes on with the same executable on the same computer in the same cirumstances.
Therefore, timing issues or randomness might be the cause.

Probably related issue:
./build/bin/thorin -d matrix -d affine lit/matrix/mapReduce_mult.thorin -o - -VVVV in matrix_dialect f3a3def
sometimes generates thorin code and sometimes prints the following error:

:4294967295: error: cannot pass argument 
  '(__806508#2:(.Idx 3), ‹__806508#2:(.Idx 3); .Idx 4294967296›, 0)' of type 
  '[.Nat, «__806508#2:(.Idx 3); ★», .Nat]' to 
  '%mem.lea' of domain 
  '[n_834521: .Nat, _834535: «n_836768; ★», _834540: .Nat]'

which seems odd to me as the arguments are of the style

(n, <n; T>; 0)

which should be the type

[n:.Nat, <<n; *>>; .Nat]

which should agree with lea.

Was fighting this issue in #184 as a Debug build produced different outputs as the Release one

  • 05e833b
    A few asserts created new Defs resulting in slightly different behavior between Debug and Release builds. This commit fixes the issue.
  • 2997a1d
    This one fixes a subtle problem when a Def has coincidentally the same name as an external Def.

As mentioned above --trace-gids and --reeval-breakpoints helped me tracking down the problem. We could probably write a test case with some non-trivial code, run it with --trace-gids and double-check in our CI that all builds produce the same output.

While #185 fixes part of this problem, there are still some odd things happening and we need a test case to test for this.