cornell-zhang/hcl-dialect

Multiple if statements testing same scalar generates fault

Closed this issue · 14 comments

def test_recexecute():
    hcl.init()
    rshape = (1,)

    def kernel():
        inst_id = hcl.scalar(0, "inst_id", dtype=hcl.UInt(16)).v
        with hcl.if_(inst_id == 0):
            hcl.print((), "match\n")
        with hcl.if_(inst_id == 1):
            hcl.print((), "match\n")
        r = hcl.compute(rshape, lambda _:0, dtype=hcl.UInt(32))
        return r
    #
    s = hcl.create_schedule([], kernel)
    #print(hcl.lower(s))
    hcl_res = hcl.asarray(np.zeros(rshape, dtype=np.uint32), dtype=hcl.UInt(32))
    f = hcl.build(s)

generates the following error:

error: 'affine.load' op operation destroyed but still has uses
LLVM ERROR: operation destroyed but still has uses
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
hcl-dialect-prototype/build/tools/hcl/python_packages/hcl_core/hcl_mlir/_mlir_libs/libHCLMLIRAggregateCAPI.so.15(+0x11be8e4)[0x14d18d68e8e4]
...

A workaround is to NOT lookup the .v value on assignment and instead use the .v where inst_id is used. Intuitively, this change shouldn't matter though. Is there an assumption that each .v result is only used once? If so, the coding style used above where the the .v is dereferenced once but used multiple times will not work.

Same issue as #71

This issue is turning out to be what's causing the most issues now (and workarounds can be quite invasive). If all references to the .v are within one function, it is easy to fix. The problem is that if it is passed to some other function, then failures would depend on how that function uses it, etc. We're suddenly in a situation where changing some low-level function (e.g., to add a bounds check) will cause a fault because it created additional references to the expression.

The same issue also seems to apply to struct fields.

Thanks! We are discussing about a proper fix for this and many related issues. Hongzheng explained this in one of our conversations, let me paste it here for reference:

I notice most of the bugs of the control flow constructs are actually caused by Python’s language features. For example, #71, #135, #138, #141, #148 are all hcl.if_/hcl.while_ problems. The main complexity comes from the Python evaluation order. Consider hcl.if_(x == 0), the condition x==0 part is evaluated (the IR is already built) before calling the hcl.if_ function. In the current workflow, we need to do the following things:

  1. Traverse back from x==0 (CmpOp) and check whether this expression is affine. If the expression is non-affine, then it is fine. We can simply generate the scf.if statement and plug the result of CmpOp into it.
  2. If the expression is affine, we need to traverse back again to build an affine set expression and remove all the previous operations involving this condition, otherwise the condition will be computed twice. We require this separation between affine and non-affine expression to leverage the affine.if facility.

When traversing back, we may remove some of the useful operations by mistake, since we do not know the scope of the variable before executing the following code. It is not like traditional parser that can read the code from left to right and generate the corresponding IR. Currently we are highly restricted by the host language, so it is a bit counterintuitive when building these control flows with conditions.

Hi @jcasas00, could you try latest commits and see if it works?

It seems to work with the .v usage now.

But it still fails if it is a struct field. For example, if the inst_id in the example above was defined as:

stype = hcl.Struct({"x": hcl.UInt(8), "y": hcl.UInt(8)})
inst_id = hcl.scalar(0x1234, "foo", dtype=stype).v.x

The error in this case is "error: operand #0 does not dominate this use". But the usage pattern that causes the issue is the same (i.e., multiple hcl.if_(i ...) statements).

Ah yes, the scope issue. When a LoadOp is moved to another block I also rebuild one in the original place. I haven't done that for StructGetOp. I'll add this for StructGetOp, GetBitOp, GetSliceOp etc.

The same error (#0 does not dominate this use) in the following scenario (note the "+ 1"):

inst_id = hcl.scalar(0, "inst_id", dtype=hcl.UInt(16)).v + 1

Hi Jeremy, could you pull both repositories and see if it works now? I've added test cases on the CI:
https://github.com/cornell-zhang/heterocl/blob/hcl-mlir/tests/mlir/test_scalar.py

The test cases cover

  • Load once before the if statement and use it in multiple conditions
  • Load in the condition
  • Struct access in the condition expression
  • Struct access before the if statements and use fields in the conditions.
  • GetBit, GetSlice, add/mul in the condition expression

This version seems to work now! Thanks for the quick fixes/updates!!

I'm seeing another failure with the same cause signature (i.e., an expression used multiple times).

def kernel():
      f = hcl.scalar(0, "f", dtype='uint32')
      t = hcl.scalar(0, "t", dtype='uint32')
      c = f.v == 0
      with hcl.if_(c == 0):
          hcl.print((),"c is false")
          with hcl.while_(c == 0):    # ... hang
              t.v = 0 # some irrelevant expr
      r = hcl.compute(rshape, lambda _:0, dtype=hcl.Int(32))
      return r

In this case, because the hcl.while_ statement is using 'c', the following error is generated:

python: /home/jcasas/dprive/llvm-project/mlir/lib/IR/Block.cpp:231: mlir::Operation* mlir::Block::getTerminator(): Assertion '!empty() && back().mightHaveTrait<OpTrait::IsTerminator>()' failed.

The intent of this code was an assert w/a for now (i.e., print the failure and hang). I could use a different while test (e.g., while((f.v != 0) and get it to work but the system should still not fault with the code above.

This seems like a different issue with the same cause. hcl.while_ used ASTVisitor in "remove" mode to remove and rebuild the condition operations in scf.while's condition block. Consequently, it removed the loadOp built from f.v, causing an assertion fault. I have fixed this issue by replacing the AST remover with an AST mover that doesn't remove any op but instead moves them to the right block.
Commit: 592b298

  • Test cases are to be added for this issue.

Removing urgent tag as the primary issue has been resolved.

The scope issue has been resolved by revamping the frontend and adding HCL AST: cornell-zhang/heterocl#473