Minres/CoreDSL

Flesh out semantics for spawns with atomic updates

jopperm opened this issue ยท 8 comments

@eyck, @AtomCrafty, @wysiwyng, @7FM: Please have a look and let me know what you think!

Regarding writes to non-volatile registers in a spawn block: what happens when you end up with multiple writes, possibly to the same register? Does DDR allow dependent instructions to execute only after the last write? What about loops?

One possible solution could be to "release" all the registers written (i.e. allow dependent instructions to execute) after the spawn block terminates. But that would obstruct a few use-cases, I presume?


The section regarding accesses to volatile registers could do with a (non-normative) rationale about why volatile accesses are only allowed in a when contidion/body.

7FM commented

If an instruction writes to an non-volatile register within its spawn-statement, then the same instruction will be implicitly blocked from executing again (with the same arguments) until the writes are complete, right?
I.e.

DELAYED_ZERO_INIT {
  encoding: 17'd0 :: rd[4:0] :: 7'b0010111;
  behavior: {
    spawn {
      unsigned<5> delay = rd;
      while (delay > 0) {
        delay -= 1;
      }
      X[rd] = 0;
    }
  }
}

and an instruction sequence:

DELAYED_ZERO_INIT rd=$x3
// <<implicit wait until spawn block execution finished?>>
DELAYED_ZERO_INIT rd=$x3

Here is an interesting edge case:

DELAYED_ZERO_INIT rd=$x3
// <<no blocking?>>
// Spawn Blocks: delay1=3
DELAYED_ZERO_INIT rd=$x2
// <<no blocking?>>
// Spawn Blocks: delay1=2 delay2=2
DELAYED_ZERO_INIT rd=$x1
// Spawn Blocks: delay1=1 delay2=1 delay3=1
// Spawn Blocks: delay1=0 delay2=0 delay3=0
// Spawn Blocks: X[3] = 0;  X[2] = 0;  X[1] = 0;

Presuming that each instruction requires a single cycle to execute and the instruction isn't waiting for prior invocations of the spawn-block, then all three spawn-blocks would need to run in parallel. Yet, the same hardware certainly can not be used for arbitrary many invocations simultaneously.
Similar cases can probably be constructed when using volatile registers or non-volatile address spaces where no DDR is utilized.
The only solutions that come to my mind are:

  • wait until a prior spawn-block invocation completed (X[3], X[2] & X[1] would be set to zero at the cost of delaying the executions of DELAYED_ZERO_INIT)
  • ignore new spawn-block invocations while an older one has not yet completed (only X[3] would be set to zero)

In either case, I think it is worth mentioning it here.

Regarding writes to non-volatile registers in a spawn block: what happens when you end up with multiple writes, possibly to the same register? Does DDR allow dependent instructions to execute only after the last write?

Yes, that is my mental model. In the HLS context, I wouldn't perform intermediate register updates that are later overwritten, and only actually do the last one.

What about loops?

Same reasoning. In the draft, I propose to constrain the indices to expressions of instruction fields and literals, so you couldn't write a loop that accesses a register depending on the loop counter (or any other form data-dependent indexing).

One possible solution could be to "release" all the registers written (i.e. allow dependent instructions to execute) after the spawn block terminates. But that would obstruct a few use-cases, I presume?

Not necessarily. It would be a bit more conservative (but also much simpler to implement) than keeping track of written registers, but nonetheless a valid implementation of the proposed semantics. I think the common case for these kinds of spawns is to wrap long-running operations that produce one (or a small number of) value(s) at the end. Do you have particular an ISAX in mind that would benefit from passing intermediate values out to consumer instructions?

The section regarding accesses to volatile registers could do with a (non-normative) rationale about why volatile accesses are only allowed in a when contidion/body.

Let me try here first. As we don't have a notion of time (cycles, pipeline stages, instructions, ...) at the CoreDSL level and no DDR for volatile accesses, we cannot give any guarantees about when an access will happen. For example, updating the PC at an unspecified future time is not a useful action. By tying updates to volatile state to changes of that state, we have a means of "timing" the action without actually modeling time.

If an instruction writes to an non-volatile register within its spawn-statement, then the same instruction will be implicitly blocked from executing again (with the same arguments) until the writes are complete, right?

Currently, no, but this is a very good point, I completely forgot about WAW dependencies.

Here is an interesting edge case:
[...]
Presuming that each instruction requires a single cycle to execute and the instruction isn't waiting for prior invocations of the spawn-block, then all three spawn-blocks would need to run in parallel. Yet, the same hardware certainly can not be used for arbitrary many invocations simultaneously.

This is certainly true, but at the language level, I hope that we don't need to constrain it. Even though the spawn statement in your example contains a loop, this doesn't say anything about how long the spawn needs to execute. My overall aim is define semantics that can be implemented in the ISS context as well as in hardware (albeit differently). An ISS can fully execute the spawn statement before advancing to the next instruction. HLS will need to set up additional logic to arbitrate access to the (potentially pipelined, or replicated) spawn module, and in the worst case, stall the main pipeline until a new invocation of the spawn statement may be launched on the limited hardware. But in the end, the observable behavior shall be the same.

Similar cases can probably be constructed when using volatile registers or non-volatile address spaces where no DDR is utilized. The only solutions that come to my mind are:

  • wait until a prior spawn-block invocation completed (X[3], X[2] & X[1] would be set to zero at the cost of delaying the executions of DELAYED_ZERO_INIT)

๐Ÿ‘

  • ignore new spawn-block invocations while an older one has not yet completed (only X[3] would be set to zero)

I think that would be quite inconvenient to reason about for assembler writers and compiler engineers.

In either case, I think it is worth mentioning it here.

๐Ÿ‘

What about loops?

Same reasoning. In the draft, I propose to constrain the indices to expressions of instruction fields and literals, so you couldn't write a loop that accesses a register depending on the loop counter (or any other form data-dependent indexing).

Yes, I was thinking about a loop updating the same register repeatedly.

One possible solution could be to "release" all the registers written (i.e. allow dependent instructions to execute) after the spawn block terminates. But that would obstruct a few use-cases, I presume?

Not necessarily. It would be a bit more conservative (but also much simpler to implement) than keeping track of written registers, but nonetheless a valid implementation of the proposed semantics. I think the common case for these kinds of spawns is to wrap long-running operations that produce one (or a small number of) value(s) at the end. Do you have particular an ISAX in mind that would benefit from passing intermediate values out to consumer instructions?

Zero-overhead loops came into mind, but the associated instruction(s) affect the PC, which is volatile. I just didn't want to exclude any valid use-case, but thinking again "releasing" a register before the last write to it would essentially mean publishing intermediate results. And for that you'd use volatile registers. So the existing semantics (a dependent instruction executes only after the last write to the register it depends on) seem perfectly sensible.

I added a rationale for the atomic updates (diff), and mentioned WAW dependences, which shall also be obeyed (diff).

Shelving this proposal for the time being.