cornell-zhang/heterocl

Support explicit unroll at certain loop axis

Closed this issue · 5 comments

Aside from unrolling a loop implicitly (i.e. by adding #pragma unroll, and let the EDA tools unroll the loop), we also want to unroll a loop into multiple PEs explicitly. This allows users to generate multiple PEs for single stage, and connect the PEs in different ways to generate custom dataflow accelerators.

An example of 1D convolution kernel:

def kernel(W, X):
        k = hcl.reduce_axis(0, K)
        return hcl.compute((size,), lambda x: sum(X[x+k]*W[k]), "Y")

# unroll the inner loop into PEs
pes = s[kernel].unroll(axis=1)
pe0, pe1, pe2 = pes

Each PE returned by the unroll() primitive will correspond to a different (non-inlined) kernel function call. HCL compiler should create separate kernel definitions and function calls for each PE.

For the 1D convolution example above, assume the loop trip count is 3. In this case, we will generate three separate functions (i.e. pe1, pe1, pe2), and call them in a dataflow region, so that they can run in parallel:

void pe0() {
    //...
}

void pe1() {
    //...
}

void pe2() {
    //...
}

void top() {
    #pragma dataflow
    pe0();
    pe1();
    pe2();
}

Looks good. This is pretty much what we agreed on.
To distinguish from the current unrolling support, maybe we should use another primitive, say paralle(), to indicate the explicit duplication?

another (perhaps cleaner) solution is to look at left hand side of the statement when we call this primitive. If we return a list of named objects, we explicitly duplicate the loop body.

I will try to add a parallel() primitive firts to avoid messing up anything in the original unroll() primitive. We can switch to the second solution later.

Since we need to create some new stages in the schedule, we may need to do something like s.parallel(stage, axis=1) (IR transformation in the schedule level) instead of s[stage].parallel(axis=1) (i.e. IR transformation inside the stage).