cornell-zhang/heterocl

Cannot apply parallel primitive in HeteroCL module

Opened this issue · 5 comments

The issue occurs in the digit recognition example with the .parallel() primitive. I was trying to use a kernel function to update the knn_mat instead of calling hcl.compute, and perform scheduling on the itervars inside the kernel function (i.e. hcl module). The program after modification looks like:

  def knn(*placeholders):

        @hcl.def_([(10,1800), (10,3)])
        def update_knn(dist, knn_mat):
            with hcl.for_(0,10, name="i") as i:
                with hcl.for_(0,1800, name="j") as j:
                    max_id = hcl.scalar(0, "max_id")
                    with hcl.for_(0, 3, name="k") as k:
                        with hcl.if_(knn_mat[i][k] > knn_mat[i][max_id.v]):
                            max_id.v = k
                    with hcl.if_(dist[i][j] < knn_mat[i][max_id.v]):
                        knn_mat[i][max_id.v] = dist[i][j]

        update_knn(dist, knn_mat)

And the scheduling is performed as the following snippet:

    knn_update = knn.update_knn
    s[knn_update].reorder(knn_update.axis[0], knn_update.axis[1])
    # ISSUE: this primitive will lead to segmentation fault
    # s[knn_update].parallel(knn_update.axis[1])
    s[knn_update].pipeline(knn_update.axis[0])

All other scheduling primitives work well, but when I call the .parallel(). The program will error out with a segmentation fault.

Do we translate the parallel() primitive to a corresponding pragma in HLS?

Currently the parallel primitive is only for CPU, which triggers multi-threaded execution.

As I mentioned before, we need to support it for hardware synthesis. Shall we open another issue? If not, this will fall through the cracks again.

It's ignored in HLS code generator. I am considering to let the CodeGenC to translate .parallel() to OpenMP pragmas.

And for HLS codegen, we may use .parallel() to perform kernel replication to exploit data-lvele parallelism?

@hecmay yes, we can at least use it for the OpenCL flow. I believe the Merlin compiler supports parallel execution as well.