Type checking cannot reason about unrolled loops after desugar

For the following example:

for (i ... ) unroll 2 {
  for (j ..) {
    let x = V[j];
  }
}

The type checker marks this program as correct because it expects the four parallel reads to V[j] to happen at the same time. Notionally, it thinks the following program is also okay (although will generate an error because the bindings for j might have changed):

for (i ... ) {
  for (j ...) {
    let x0 = v[j]; 
  }
  for (j ...) {
    let x0 = v[j]; 
  } 
}

The reason the type checker marks the first program correct but rejects the second program is because it has no idea that the parallel execution of the two v[j] will actually refer to the same memory position. The unroll provides additional structure to the type checker that lets it "know" that this will actually happen.

If you're tempted to re-ordered the loops and hoist the memory out to solve this problem, here is an example where you can't do that:

decl R: float[6 bank 3]
decl V: float[8 bank 4];
for (let i: bit<3> = 0..6) unroll 3 {
  for (let j: bit<4> = 0..8) unroll 4 {
    let rs = V[j];
  } combine {
    R[i] += rs;
  }
}

Upshot: The core calculus we presented in the paper is less expressive than the type checker we implemented. unroll is fundamentally more expressive because the equivalent unrolled program loses the extra timing information.

The more general principle here is that given a program:

{A --- B};
{C --- D}

the type checker has to be conservative and reject conflicts between B and C because it can't be sure that C will not overlap with A. However, with code generated from unroll, this is going to be precisely true.

Yes, this is true. In general, it is true that the type system is more powerful (i.e., can prove more programs correct) in the sugared form that the desugared form. We have discussed a few times how to get a theorem that says "sugared program type-checks => desugared program type-checks," but the conclusion each time has been that the lower-level type system would need to be very complex to capture this kind of safety—it turns out to be much easier to enforce some properties on the more constrained programs that appear at the higher, sugared level.

This lack of a "safety implication theorem" is very true of the Filament we have in the paper, for even simpler reasons—it has no banking, and it only has while loops! And we can't prove anything fancy about those. Designing such a complex, low-level, free-form type system is something I see as a goal of the "new Filament" idea.

Regarding this specific example, FWIW, I think it's easiest to explain by analogy to unrolling with ---, which we tread carefully around in the paper. Basically, we say that:

for (…) unroll 2 { A --- B }

Does not become:

for (…) { (A1 --- B1) ; (A2 --- B2) }

But rather:

for (…) { (A1 ; A2) --- (B1; B2) }

Intuitively, the steps inside the unrolled loop run in lockstep. A loop is "just" a series of time steps (the only difference is that it reuses the same hardware resources instead of requiring copies of the hardware resources), so an inner loop instead of an inner --- needs similar reasoning when unrolling the surrounding loop. In other words, this:

for (…) unroll 2 {
  for (…) { A }
}

Should not become:

for (…) unroll 2 {
  for (…) { A1 } ; for (…) { A2 }
}

Instead, it should become:

for (…) {
   for (…) { A1 ; A2 }
}

Except that doesn't quite work:

for (...) unroll 2 {
  for (...) {
    let x = v[0];
    ---
    let y = v[1];
  }
}

into:

for (...) {
  for (...) {
    let x = v[0];
    ---
    let y = v[1];
  }
  for (...) {
    let x = v[0];
    ---
    let y = v[1];
  }
}

into:

for (...) {
  for (...) {
    { let x = v[0]; --- let y = v[1]; }
    { let x = v[0]; --- let y = v[1]; } // <-- error! parallel reads to v[0]
  }
}

The solution in this case is also to de-nest the inner ---. We can keep doing it repeatedly but its just a hack--it tries to tell convey the timing property to the type checker by mangling the program. I don't have a problem with that but I want to figure out the principle behind when such mangling is needed.

It's true that it's not very compositional, but the thing you'd want to do is use our rule above for "lockstep" unrolling of ----containing loops when pushing the parallelization into the inner loop, to obtain:

for (…) {
  for (…) {
    let x = v[0]; let x = v[0]
    ---
    let y = v[1]; let y = v[1]
  }
}

Put slightly differently, you can imagine a helper function called lockstep-par. Applying lockstep-par to two simple statements A and B produces A ; B. Applying it to two sequences A1 --- A2 and B1 --- B2 produces {A1 ; B1} --- {A2 ; B2}, except that instead of ; it actually recursively calls lockstep-par on the components. It also works for loops (and recursively parallelizes the body).

I know how to fix it, but I remain annoyed that it is not compositional.

Also, the same problem will happen when we generate if, while, or any scope-based construct because capabilities don't live through scopes.

See the generalized implementation for this here:

dahlia/src/main/scala/passes/LowerUnroll.scala

Line 244 in 5cfec5a

/**