cornell-zhang/hcl-dialect

[Op] Support popcount in hardware

Closed this issue · 4 comments

Population count (popcount) is a very important hardware intrinsic but still not supported by HeteroCL. Currently we need to explicitly write popcount as a reduction operation on every bit. An example is shown below.

def test_popcount():
    hcl.init(hcl.Int(32))
    A = hcl.placeholder((3,))
    B = hcl.placeholder((3,))
    def conv(A, B):
        rb = hcl.reduce_axis(0, 32)
        B = hcl.compute((3,), lambda i:
            hcl.sum((A[i] + B[i])[rb], axis=[rb], dtype=hcl.Int(32)), name="B", dtype=hcl.Int(32))
        return B
    s = hcl.create_schedule([A, B], conv)
    print(hcl.lower(s))

It generates the following MLIR code.

module {
  func @top(%arg0: memref<3xi32>, %arg1: memref<3xi32>) -> memref<3xi32> attributes {bit, extra_itypes = "ss", extra_otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<3xi32>
    affine.for %arg2 = 0 to 3 {
      %1 = memref.alloc() {name = "sum_rv"} : memref<1xi32>
      %c0 = arith.constant 0 : index
      %c0_i32 = arith.constant 0 : i32
      affine.store %c0_i32, %1[%c0] {to = "sum_rv"} : memref<1xi32>
      affine.for %arg3 = 0 to 32 {
        %3 = affine.load %arg0[%arg2] {from = "compute_0"} : memref<3xi32>
        %4 = affine.load %arg1[%arg2] {from = "compute_1"} : memref<3xi32>
        %5 = arith.addi %3, %4 : i32
        %6 = hcl.get_bit(%5 : i32, %arg3) -> i1
        %7 = affine.load %1[%c0] {from = "sum_rv"} : memref<1xi32>
        %8 = arith.extsi %6 : i1 to i32
        %9 = arith.addi %8, %7 : i32
        affine.store %9, %1[%c0] {to = "sum_rv"} : memref<1xi32>
      } {loop_name = "rx_0", reduction}
      %c0_0 = arith.constant 0 : index
      %2 = affine.load %1[%c0_0] {from = "sum_rv"} : memref<1xi32>
      affine.store %2, %0[%arg2] {to = "B"} : memref<3xi32>
    } {loop_name = "i", stage_name = "B"}
    return %0 : memref<3xi32>
  }
}

But this incurs lots of redundant memory loads and computation in the inner-most reduction loop. Acutally A[i]+B[i] can be hoisted from the inner-most loop, and popcount can be implemented as a reduction tree. Otherwise, there will be loop-carried dependency on the reduction variable, causing low performance on FPGA.

After doing some experiments on Vivado HLS, I am not sure whether a specific implementation for popcount is needed. In the following binary convolution examples, the latency is similar, but the resource usage differs a lot.

void top(
  ap_uint<6> v0[4][8][8][1],
  ap_uint<6> v1[16][3][3][1],
  ap_int<32> v2[4][6][6][16]
) {	//
  l_B_n: for (int n = 0; n < 4; n++) {	//
    l_h: for (int h = 0; h < 6; h++) {	//
      l_w: for (int w = 0; w < 6; w++) {	//
        l_c: for (int c = 0; c < 16; c++) {	//
        #pragma HLS pipeline II=1
          ap_int<32> sum_rv;	//
          sum_rv = 0;	//
          l_rx_1: for (int rx_1 = 0; rx_1 < 3; rx_1++) {	//
            l_rx_2: for (int rx_2 = 0; rx_2 < 3; rx_2++) {	//
              l_rx_0: for (int rx_0 = 0; rx_0 < 1; rx_0++) {	//
                l_rx_3: for (int rx_3 = 0; rx_3 < 6; rx_3++) {	//
                  ap_uint<6> v12 = v0[n][(h + rx_1)][(w + rx_2)][rx_0];	//
                  ap_uint<6> v13 = v1[c][rx_1][rx_2][rx_0];	//
                  ap_uint<6> v14 = v12 ^ v13;	//
                  ap_uint<1> v15 = v14[rx_3];	//
                  ap_int<32> v16 = sum_rv;	//
                  ap_int<32> v17 = v15;	//
                  ap_int<32> v18 = v17 + v16;	//
                  sum_rv = v18;	//
                }
              }
            }
          }
          ap_int<32> v19 = sum_rv;	//
          ap_int<32> v20 = v19 << 1;	//
          ap_int<32> v21 = 54 - v20;	//
          v2[n][h][w][c] = v21;	//
        }
      }
    }
  }
}

This one views popcount as an inner-most reduction loop and gets the following results (II=5).

```bash
+-------------------+-----------------------------------+
| HLS Version       | Vivado HLS 2019.2.1               |
| Product family    | zynq                              |
| Target device     | xc7z020-clg484-1                  |
| Top Model Name    | top                               |
+-------------------+-----------------------------------+
| Target CP         | 10.00 ns                          |
| Estimated CP      | 8.548 ns                          |
| Latency (cycles)  | Min 11525 ; Max 11525             |
| Interval (cycles) | Min 11526 ; Max 11526             |
| Resources         | Type        Used    Total    Util |
|                   | --------  ------  -------  ------ |
|                   | BRAM_18K       0      280      0% |
|                   | DSP48E         0      220      0% |
|                   | FF           229   106400      0% |
|                   | LUT         1261    53200      2% |
+-------------------+-----------------------------------+

Another implementation moves common subexpressions outside the inner-most loop, but obtains worse performance, especially for the LUT and FF usage.

void top(
  ap_uint<6> v0[4][8][8][1],
  ap_uint<6> v1[16][3][3][1],
  ap_int<32> v2[4][6][6][16]
) {	//
  l_B_n: for (int n = 0; n < 4; n++) {	//
    l_h: for (int h = 0; h < 6; h++) {	//
      l_w: for (int w = 0; w < 6; w++) {	//
        l_c: for (int c = 0; c < 16; c++) {	//
        #pragma HLS pipeline II=1
          ap_int<32> sum_rv;	//
          sum_rv = 0;	//
          l_rx_1: for (int rx_1 = 0; rx_1 < 3; rx_1++) {	//
            l_rx_2: for (int rx_2 = 0; rx_2 < 3; rx_2++) {	//
              l_rx_0: for (int rx_0 = 0; rx_0 < 1; rx_0++) {	//
                ap_uint<6> v12 = v0[n][(h + rx_1)][(w + rx_2)][rx_0];	//
                ap_uint<6> v13 = v1[c][rx_1][rx_2][rx_0];	//
                ap_uint<6> v14 = v12 ^ v13;	//
                ap_int<32> popcount_rv;  //
                popcount_rv = 0;  //
                l_rx_3: for (int rx_3 = 0; rx_3 < 6; rx_3++) {	//
                  ap_uint<1> v15 = v14[rx_3];	//
                  ap_int<32> v16 = popcount_rv;	//
                  ap_int<32> v17 = v15;	//
                  ap_int<32> v18 = v17 + v16;	//
                  popcount_rv = v18;	//
                }
                sum_rv = sum_rv + popcount_rv;	//
              }
            }
          }
          ap_int<32> v19 = sum_rv;	//
          ap_int<32> v20 = v19 << 1;	//
          ap_int<32> v21 = 54 - v20;	//
          v2[n][h][w][c] = v21;	//
        }
      }
    }
  }
}
+-------------------+-----------------------------------+
| HLS Version       | Vivado HLS 2019.2.1               |
| Product family    | zynq                              |
| Target device     | xc7z020-clg484-1                  |
| Top Model Name    | top                               |
+-------------------+-----------------------------------+
| Target CP         | 10.00 ns                          |
| Estimated CP      | 7.195 ns                          |
| Latency (cycles)  | Min 11528 ; Max 11528             |
| Interval (cycles) | Min 11528 ; Max 11528             |
| Resources         | Type        Used    Total    Util |
|                   | --------  ------  -------  ------ |
|                   | BRAM_18K       0      280      0% |
|                   | DSP48E         0      220      0% |
|                   | FF           422   106400      0% |
|                   | LUT         1964    53200      3% |
+-------------------+-----------------------------------+

Vivado HLS now can automatically generate a reduction tree for integer operations, so no need to write ourselves.
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Optimizing-Logic-Expressions

Actually if no intermediate variable popcount_rv is introduced, the latency and resource usage will be exactly the same.

The problem of popcount is that if it also needs to unroll the loops, which may limit the pipeline depth of HLS. (HLS may not unroll loops with a total count larger than 1024.) Thus, it is still needed to have a high-performance implementation of popcnt, which can refer to here, and this does not require IR support, but can be directly implemented in the frontend.