[Op] Support popcount in hardware

Question

[Op] Support popcount in hardware

Closed this issue 2 years ago · 4 comments

Population count (popcount) is a very important hardware intrinsic but still not supported by HeteroCL. Currently we need to explicitly write popcount as a reduction operation on every bit. An example is shown below.

def test_popcount():
    hcl.init(hcl.Int(32))
    A = hcl.placeholder((3,))
    B = hcl.placeholder((3,))
    def conv(A, B):
        rb = hcl.reduce_axis(0, 32)
        B = hcl.compute((3,), lambda i:
            hcl.sum((A[i] + B[i])[rb], axis=[rb], dtype=hcl.Int(32)), name="B", dtype=hcl.Int(32))
        return B
    s = hcl.create_schedule([A, B], conv)
    print(hcl.lower(s))

It generates the following MLIR code.

module {
  func @top(%arg0: memref<3xi32>, %arg1: memref<3xi32>) -> memref<3xi32> attributes {bit, extra_itypes = "ss", extra_otypes = "s"} {
    %0 = memref.alloc() {name = "B"} : memref<3xi32>
    affine.for %arg2 = 0 to 3 {
      %1 = memref.alloc() {name = "sum_rv"} : memref<1xi32>
      %c0 = arith.constant 0 : index
      %c0_i32 = arith.constant 0 : i32
      affine.store %c0_i32, %1[%c0] {to = "sum_rv"} : memref<1xi32>
      affine.for %arg3 = 0 to 32 {
        %3 = affine.load %arg0[%arg2] {from = "compute_0"} : memref<3xi32>
        %4 = affine.load %arg1[%arg2] {from = "compute_1"} : memref<3xi32>
        %5 = arith.addi %3, %4 : i32
        %6 = hcl.get_bit(%5 : i32, %arg3) -> i1
        %7 = affine.load %1[%c0] {from = "sum_rv"} : memref<1xi32>
        %8 = arith.extsi %6 : i1 to i32
        %9 = arith.addi %8, %7 : i32
        affine.store %9, %1[%c0] {to = "sum_rv"} : memref<1xi32>
      } {loop_name = "rx_0", reduction}
      %c0_0 = arith.constant 0 : index
      %2 = affine.load %1[%c0_0] {from = "sum_rv"} : memref<1xi32>
      affine.store %2, %0[%arg2] {to = "B"} : memref<3xi32>
    } {loop_name = "i", stage_name = "B"}
    return %0 : memref<3xi32>
  }
}

But this incurs lots of redundant memory loads and computation in the inner-most reduction loop. Acutally A[i]+B[i] can be hoisted from the inner-most loop, and popcount can be implemented as a reduction tree. Otherwise, there will be loop-carried dependency on the reduction variable, causing low performance on FPGA.

Answer 1 · 2022-05-24T19:09:50.000Z

After doing some experiments on Vivado HLS, I am not sure whether a specific implementation for popcount is needed. In the following binary convolution examples, the latency is similar, but the resource usage differs a lot.

void top(
  ap_uint<6> v0[4][8][8][1],
  ap_uint<6> v1[16][3][3][1],
  ap_int<32> v2[4][6][6][16]
) {	//
  l_B_n: for (int n = 0; n < 4; n++) {	//
    l_h: for (int h = 0; h < 6; h++) {	//
      l_w: for (int w = 0; w < 6; w++) {	//
        l_c: for (int c = 0; c < 16; c++) {	//
        #pragma HLS pipeline II=1
          ap_int<32> sum_rv;	//
          sum_rv = 0;	//
          l_rx_1: for (int rx_1 = 0; rx_1 < 3; rx_1++) {	//
            l_rx_2: for (int rx_2 = 0; rx_2 < 3; rx_2++) {	//
              l_rx_0: for (int rx_0 = 0; rx_0 < 1; rx_0++) {	//
                l_rx_3: for (int rx_3 = 0; rx_3 < 6; rx_3++) {	//
                  ap_uint<6> v12 = v0[n][(h + rx_1)][(w + rx_2)][rx_0];	//
                  ap_uint<6> v13 = v1[c][rx_1][rx_2][rx_0];	//
                  ap_uint<6> v14 = v12 ^ v13;	//
                  ap_uint<1> v15 = v14[rx_3];	//
                  ap_int<32> v16 = sum_rv;	//
                  ap_int<32> v17 = v15;	//
                  ap_int<32> v18 = v17 + v16;	//
                  sum_rv = v18;	//
                }
              }
            }
          }
          ap_int<32> v19 = sum_rv;	//
          ap_int<32> v20 = v19 << 1;	//
          ap_int<32> v21 = 54 - v20;	//
          v2[n][h][w][c] = v21;	//
        }
      }
    }
  }
}

This one views popcount as an inner-most reduction loop and gets the following results (II=5).

```bash
+-------------------+-----------------------------------+
| HLS Version       | Vivado HLS 2019.2.1               |
| Product family    | zynq                              |
| Target device     | xc7z020-clg484-1                  |
| Top Model Name    | top                               |
+-------------------+-----------------------------------+
| Target CP         | 10.00 ns                          |
| Estimated CP      | 8.548 ns                          |
| Latency (cycles)  | Min 11525 ; Max 11525             |
| Interval (cycles) | Min 11526 ; Max 11526             |
| Resources         | Type        Used    Total    Util |
|                   | --------  ------  -------  ------ |
|                   | BRAM_18K       0      280      0% |
|                   | DSP48E         0      220      0% |
|                   | FF           229   106400      0% |
|                   | LUT         1261    53200      2% |
+-------------------+-----------------------------------+

Another implementation moves common subexpressions outside the inner-most loop, but obtains worse performance, especially for the LUT and FF usage.

void top(
  ap_uint<6> v0[4][8][8][1],
  ap_uint<6> v1[16][3][3][1],
  ap_int<32> v2[4][6][6][16]
) {	//
  l_B_n: for (int n = 0; n < 4; n++) {	//
    l_h: for (int h = 0; h < 6; h++) {	//
      l_w: for (int w = 0; w < 6; w++) {	//
        l_c: for (int c = 0; c < 16; c++) {	//
        #pragma HLS pipeline II=1
          ap_int<32> sum_rv;	//
          sum_rv = 0;	//
          l_rx_1: for (int rx_1 = 0; rx_1 < 3; rx_1++) {	//
            l_rx_2: for (int rx_2 = 0; rx_2 < 3; rx_2++) {	//
              l_rx_0: for (int rx_0 = 0; rx_0 < 1; rx_0++) {	//
                ap_uint<6> v12 = v0[n][(h + rx_1)][(w + rx_2)][rx_0];	//
                ap_uint<6> v13 = v1[c][rx_1][rx_2][rx_0];	//
                ap_uint<6> v14 = v12 ^ v13;	//
                ap_int<32> popcount_rv;  //
                popcount_rv = 0;  //
                l_rx_3: for (int rx_3 = 0; rx_3 < 6; rx_3++) {	//
                  ap_uint<1> v15 = v14[rx_3];	//
                  ap_int<32> v16 = popcount_rv;	//
                  ap_int<32> v17 = v15;	//
                  ap_int<32> v18 = v17 + v16;	//
                  popcount_rv = v18;	//
                }
                sum_rv = sum_rv + popcount_rv;	//
              }
            }
          }
          ap_int<32> v19 = sum_rv;	//
          ap_int<32> v20 = v19 << 1;	//
          ap_int<32> v21 = 54 - v20;	//
          v2[n][h][w][c] = v21;	//
        }
      }
    }
  }
}

+-------------------+-----------------------------------+
| HLS Version       | Vivado HLS 2019.2.1               |
| Product family    | zynq                              |
| Target device     | xc7z020-clg484-1                  |
| Top Model Name    | top                               |
+-------------------+-----------------------------------+
| Target CP         | 10.00 ns                          |
| Estimated CP      | 7.195 ns                          |
| Latency (cycles)  | Min 11528 ; Max 11528             |
| Interval (cycles) | Min 11528 ; Max 11528             |
| Resources         | Type        Used    Total    Util |
|                   | --------  ------  -------  ------ |
|                   | BRAM_18K       0      280      0% |
|                   | DSP48E         0      220      0% |
|                   | FF           422   106400      0% |
|                   | LUT         1964    53200      3% |
+-------------------+-----------------------------------+

Answer 2 · 2022-05-26T19:03:16.000Z

Vivado HLS now can automatically generate a reduction tree for integer operations, so no need to write ourselves.
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Optimizing-Logic-Expressions

Answer 3 · 2022-06-26T01:46:32.000Z

Actually if no intermediate variable popcount_rv is introduced, the latency and resource usage will be exactly the same.

Answer 4 · 2022-07-06T02:32:27.000Z

The problem of popcount is that if it also needs to unroll the loops, which may limit the pipeline depth of HLS. (HLS may not unroll loops with a total count larger than 1024.) Thus, it is still needed to have a high-performance implementation of popcnt, which can refer to here, and this does not require IR support, but can be directly implemented in the frontend.