cornell-zhang/heterocl

How to split one stage into two with hcl.split()

antonysigma opened this issue · 2 comments

Hi HeteroCL developers,

I came across a similar tutorial in the project Halide-HLS, in which they customized the 2D convolution algorithm by (1) split the large image into tiles in the host (Zynq ARM64), and then (2) send the tiles to the accelerator (Zynq FPGA) to run the convolution steps. The processed tiles are sent back to the host for tile stitching.

Reference:
https://github.com/jingpu/Halide-HLS/blob/905d2f2ad560246673ba3a84b8a6d8be308e481f/apps/hls_examples/gaussian_hls/pipeline.cpp#L103-L107

I wonder how we can describe such a customization with the HeteroCL scheduling syntax, without rewriting the algorithm?

In other words, how do I "split" the stage B into sub-stages tile_producer and tile_consumer, like the following pseudo-code?
Or, should I explicitly describe the sub-stages in order to utilize the hcl.to() syntax?

def one_stage(A):
    B = hcl.compute(A.shape, lambda x, y: A[x, y] + A[x + 1, y] 
                                         + A[x, y + 1] + A[x + 1, y + 1], "B")
    return B

s = hcl.create_schedule([A], one_stage)

# Split the image into tiles of size 2x2
s_B = one_stage.B
x_out, y_out, x_in, y_in = s[s_B].split_to_tiles(s_B.axis[0], s_B.axis[1], 2, 2)

# Define a mock-up target
target = hcl.Platform.zcu102
target.config(compiler="vitis", backend="vhls")

# Implement intermediate stage "tile producer" on the host CPU,
s.to(A, target.host)
s.to(s_B.axis[x_out, y_out], target.host)

# then push the tiles into accelerator
s.to(s_B.axis[x_in, y_in], target.xcel)

# Move the "tile consumer" output back to host CPU for tile stitching
s.to(s_B, target.host) # Not sure how to implement it with HeteroCL

Hi @antonysigma. Thanks for your interest.

In the future release, we will support data movement under a specific loop axis, and you can combine it with loop tiling/reordering to realize the computation you described. To be more concrete, please see the following code example:

def one_stage(A):
    B = hcl.compute(A.shape, lambda x, y: A[x, y] + A[x + 1, y] 
                                         + A[x, y + 1] + A[x + 1, y + 1], "B")
    return B

s = hcl.create_schedule([A], one_stage)

# Define a mock-up target
target = hcl.Platform.zcu102
target.config(compiler="vitis", backend="vhls")

# Split the image into tiles of size 2x2
s_B = one_stage.B
yo, yi, xo, xi = s[s_B].tile(axis=[0,1], factor=[2,2])
s[s_B].reorder([yo, xo, yi, xi])

# Move input from host to FPGA accelerator and
# store the input (tile) under loop axis yi inside a local on-chip buffer
s.to(A, target.xcel).to(s_B, axis=yi)

# Move the output from FPGA to host when the convolution on input tile is done
s.to(s_B, target.host, axis=yi)

In other words, the substages for producing and consuming image tiles would be inferred by HCL compiler automatically based on the information provided by .to() primitive. Right now the master branch of HCL only provides preliminary support for .to() to move the entire tensor between host and accelerator, but we will release a new version of HCL very soon to support this feature. Stay tuned!

Thank you @hecmay for the prompt reply! For sure, I look forward to the data movement customization by the loop axis.

It is also very helpful to see an example code at this stage. When the new feature is delivered on Github, I will be curious about how the order of the following calls influence the data transfer mechanisms.

s.to(s_B, axis=yi).to(s_B, target.host, axis=yi)

s.to(s_B, target.host, axis=yi).to(s_B, axis=yi)