cornell-zhang/heterocl

reuse_at failed with SegFault after adding a void wrapper stage

Opened this issue · 2 comments

I am in the process of rewriting the .to() API. To avoid any possible conflicts with the memory customizations, the new version of .to() does mots of the work (e.g. new buffer generation, modifying the function body) after the reuse buffer generation IR pass is finished.

The only thing .to() does before reuse buffer generation is creating a void wrapper stage. However, it seems that this void wrapper stage may prevent the IR pass from identifying the results pattern correctly.

Here is the program to reproduce the error. The program works well without using .to. After applying both resue_at and .to(), the program failed in ReuseBufferGeneration pass with SegFault.

def test_extract_subgraph():

    hcl.init()
    A = hcl.placeholder((10, 32), "A")
    B = hcl.placeholder((10, 32), "B")
    C = hcl.placeholder((10, 32), "C")
    D = hcl.compute(A.shape, lambda y, x: A[y, x] + B[y, x], "D")
    E = hcl.compute(C.shape, lambda y, x: C[y, x] * D[y, x], "E")
    F = hcl.compute((10, 30), lambda y, x: E[y, x] + E[y, x+1] + E[y, x+2], "F")

    s = hcl.create_schedule([A, B, C, D, E, F])
    RB = s.reuse_at(E, s[F], F.axis[1])
    s.partition(RB, hcl.Partition.Block)

    s.to([A, B, C], target.xcel)
    s.to(E, target.host)

Here is the IR before entering the ReuseBufferGeneration IR pass. You can see that .to() only adds a produce test wrapper to represent the device scope. All the buffers (i.e. E, F) stays untouched. The reuse node is inserted at the right place.

// attr [test] storage_scope = "global"
allocate test[int32 * 1]
produce test {
  // attr [0] extern_scope = 0
  // attr [(undefined)] device_scope = "fpga"
  produce D {
    // attr [0] extern_scope = 0
    for (y, 0, 10) {
      for (x, 0, 32) {
        D[(x + (y*32))] = int32((int33(A[(x + (y*32))]) + int33(B[(x + (y*32))])))
      }
    }
  }
  produce E {
    // attr [0] extern_scope = 0
    for (y, 0, 10) {
      for (x, 0, 32) {
        E[(x + (y*32))] = int32((int64(C[(x + (y*32))])*int64(D[(x + (y*32))])))
      }
    }
  }
}
produce F {
  // attr [0] extern_scope = 0
  for (y, 0, 10) {
    for (x, 0, 30) {
      reuse E
      // attr [E.reuse] storage_scope = "global"
      allocate E.reuse[int32 * 1]
      array partition variable=E.reuse block factor=0 dim=0
      produce E.reuse {
        // attr [0] extern_scope = 0
        // attr [E.reuse.partitioned] storage_scope = "global"
        allocate E.reuse.partitioned[int32 * 1]
        0
      }
      F[(x + (y*30))] = int32((int34((int33(E[(x + (y*32))]) + int33(E[((x + 1) + (y*32))]))) + int34(E[((x + 2) + (y*32))])))
    }
  }
}

This issue might be relevant to #230

Here is the initial attempt of .to() revamp: https://github.com/Hecmay/heterocl/tree/fix

This is issue has been fixed.

The root cause is: The IR pass is trying to locate the buffer (i.e. the target buffer to be reused) in the program using its pointer address. However, the actual target buffer's pointer address is different from the address on file. So the IR pass cannot find the target buffer and thus the SegFault.

The fix: The target buffer is still at the same location, but with different pointer. So I added a new pass before the ReuseBufferGeneration pass, to enforce their pointer address to be the same. This is not a clean solution, but it can be easily removed after we find the right fix.