cornell-zhang/heterocl

Incorrect flattened array degrades performance of `reuse_at`

Closed this issue · 2 comments

The code in Tutorial 06: Memory Customization generates results with poor performance, since the line buffer is incorrectly flattened.

void default_function(ap_int<32> A[6*6], ap_int<32> F[3*3], ap_int<32> B[4*4]) {
  #pragma HLS array_partition variable=F complete dim=0
  ap_int<32> _top;
  ap_int<32> LB[18];
  ap_int<32> WB[9];
  for (ap_int<32> y_reuse = 0; y_reuse < 6; ++y_reuse) {
    for (ap_int<32> x_reuse = 0; x_reuse < 6; ++x_reuse) {
    #pragma HLS pipeline
      for (ap_int<32> A_1 = 0; A_1 < 2; ++A_1) {
        LB[(x_reuse + (A_1 * 6))] = LB[((x_reuse + (A_1 * 6)) + 6)];
      }
      LB[(x_reuse + 12)] = A[(x_reuse + (y_reuse * 6))];
      if (2 <= y_reuse) {
        for (ap_int<32> LB_1 = 0; LB_1 < 3; ++LB_1) {
          for (ap_int<32> LB_0 = 0; LB_0 < 2; ++LB_0) {
            WB[(LB_0 + (LB_1 * 3))] = WB[((LB_0 + (LB_1 * 3)) + 1)];
          }
          WB[((LB_1 * 3) + 2)] = LB[(x_reuse + (LB_1 * 6))];
        }
        if (2 <= x_reuse) {
          ap_int<32> sum;
          sum = 0;
          for (ap_int<32> ra8 = 0; ra8 < 3; ++ra8) {
            for (ap_int<32> ra9 = 0; ra9 < 3; ++ra9) {
              sum = ((ap_int<32>)(((ap_int<65>)(((ap_int<64>)WB[(ra9 + (ra8 * 3))]) * ((ap_int<64>)F[(ra9 + (ra8 * 3))]))) + ((ap_int<65>)sum)));
            }
          }
          B[((x_reuse + (y_reuse * 4)) + -10)] = sum;
        }
      }
    }
  }
}

The HLS report is listed below, where II is not equal to 1.

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+---------+
    |  Latency  |  Interval | Pipeline|
    | min | max | min | max |   Type  |
    +-----+-----+-----+-----+---------+
    |  150|  150|  150|  150|   none  |
    +-----+-----+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +----------+-----+-----+----------+-----------+-----------+------+----------+
        |          |  Latency  | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name| min | max |  Latency |  achieved |   target  | Count| Pipelined|
        +----------+-----+-----+----------+-----------+-----------+------+----------+
        |- Loop 1  |  148|  148|         9|          4|          1|    36|    yes   |
        +----------+-----+-----+----------+-----------+-----------+------+----------+

Moreover, HeteroCL has not automatically inserted optimization primitives (e.g. #pragma HLS pipeline) to the reuse_at part at this time.

Just as a reminder. It's no need to add partition primitives for reuse buffers.
https://github.com/cornell-zhang/heterocl/blob/master/tutorials/tutorial_06_memory.py#L238-L239

Once the outer loop is pipelined, LB and WB will be automatically partitioned, as shown in the output of Vivado HLS.

INFO: [XFORM 203-102] Partitioning array 'LB.V' (vhls_code.cpp:10) in dimension 1 automatically.
INFO: [XFORM 203-102] Partitioning array 'WB.V' (vhls_code.cpp:12) in dimension 1 automatically.
INFO: [XFORM 203-102] Partitioning array 'WB.V.0' (vhls_code.cpp:12) automatically.
INFO: [XFORM 203-102] Partitioning array 'WB.V.1' (vhls_code.cpp:12) automatically.
INFO: [XFORM 203-102] Partitioning array 'WB.V.2' (vhls_code.cpp:12) automatically.
INFO: [XFORM 203-101] Partitioning array 'F.V' (vhls_code.cpp:7) in dimension 1 completely.
INFO: [XFORM 203-101] Partitioning array 'F.V' (vhls_code.cpp:7) in dimension 2 completely.