cornell-zhang/heterocl

Inter-stage streaming cannot pass synthesis

Closed this issue · 4 comments

This program uses .to to perform inter-stage data streaming.

def test():
    A = hcl.placeholder((10,), "A")

    def kernel(A):
        B = hcl.compute(A.shape, lambda i: A[i] + 1, "B")
        C = hcl.compute(B.shape, lambda i: A[i] + B[i], "C")
        return C

    target = hcl.platform.zc706
    target.config(compile="vivado_hls", mode="csyn")
    s = hcl.create_schedule([A], kernel)
    s.to([A], target.xcel)
    s.to(kernel.C, target.host)
    s.to(kernel.B, s[kernel.C]) # inter-stage streaming
    f = hcl.build(s, target)
    np_A = np.zeros((10,))
    np_C = np.zeros((10,))
    hcl_A = hcl.asarray(np_A)
    hcl_C = hcl.asarray(np_C)
    f(hcl_A, hcl_C)

The generated code is shown below, which do not specify the dataflow region or the FIFO size, causing synthesis errors.

  void test(hls::stream<bit32 >& A_channel, hls::stream<bit32 >& C_channel) {
  #pragma HLS INTERFACE axis port=A_channel offset=slave bundle=gmem0
  #pragma HLS INTERFACE axis port=C_channel offset=slave bundle=gmem1
  #pragma HLS INTERFACE s_axilite port=return bundle=control
    bit32 C[10];
    bit32 A[10];
    for (bit32 A0 = 0; A0 < 10; ++A0) {
      A[A0] = A_channel.read();
    }
    bit32 B[10];
    hls::stream<bit32 > B_pipe1;
    for (bit32 i = 0; i < 10; ++i) {
      B_pipe1.write((A[i] + 1));
    }
    for (bit32 i1 = 0; i1 < 10; ++i1) {
      C[i1] = ((bit32)(((ap_int<33>)A[i1]) + ((ap_int<33>)B_pipe1.read())));
    }
    for (bit32 C0 = 0; C0 < 10; ++C0) {
      C_channel.write(C[C0]);
    }
  }
ERROR: [XFORM 203-733] An internal stream 'B_pipe1.V.V' (kernel.cpp:22) with default size is used in a non-dataflow region, which may result in deadlock. Please consider to resize the stream using the directive 'set_directive_stream' or the 'HLS stream' pragma.

This underscores the importance of doing automatic inference of the FIFO sizing.
Can we manually specify the FIFO size in HeteroCL to workaround this issue?

Yes, I am just trying to tackle this problem, and this program is a test example. I discussed with @hecmay about the implementation of .to yesterday. Seems there're still several issues need to be fixed.

Two more issues here:

  1. s.to(A, s[kernel.B]) may cause Segmentation Fault.
  2. Consecutive data streaming cannot work correctly. Stream buffer for Stage C cannot be generated properly in the following code.
s.to(kernel.B, s[kernel.C])
s.to(kernel.C, s[kernel.D])