Issue with systolic array based large matrix multiplication

Question

Issue with systolic array based large matrix multiplication

Opened this issue 4 years ago · 7 comments

Hi !

I am deploying the heteroCL on the Vitis platform and running the systolic array based matrix multiplication sample systolic_array_vitis.py.
Here are three issues I came across, wondering whether we can possibly ask for some suggestions?

In the original kernel implementation in the systolic_array_vitis.py , the dot product result in the systolic array implementation seems to be inconsistent with the classical matrix multiplication.
The generated Vivado HLS C++ kernel code is shown below, where the localB matrix has the danger of index out of range.

 bit32 localB[15][16];
   for (bit32 args = 0; args < 15; ++args) {
     for (bit32 args0 = 0; args0 < 16; ++args0) {
       localB[args][args0] = 0;
     }
   }
   for (bit32 args1 = 0; args1 < 16; ++args1) {
     for (bit32 args01 = 0; args01 < 16; ++args01) {
       output[args1][args01] = 0;
     }
   }
   bit32 localA[16][15];
   for (bit32 args2 = 0; args2 < 16; ++args2) {
     for (bit32 args02 = 0; args02 < 15; ++args02) {
       localA[args2][args02] = 0;
     }
   }
   bit32 update;
   for (bit32 k = 0; k < 16; ++k) {
   #pragma HLS pipeline
     for (bit32 y = 0; y < 16; ++y) {
       for (bit32 x = 0; x < 16; ++x) {
         bit32 temp_localA;
         temp_localA = ((0 < x) ? ((bit32)localA[y][(x + -1)]) : ((bit32)A[y][k]));
         bit32 temp_localB;
         temp_localB = ((0 < y) ? ((bit32)localB[(y + -1)][x]) : ((bit32)B[k][x]));
         output[y][x] = ((bit32)(((ap_int<65>)((k == 0) ? ((bit32)0) : ((bit32)output[y][x]))) + ((ap_int<65>)(((ap_int<64>)temp_localA) * ((ap_int<64>)temp_localB)))));
         localA[((x / 15) + y)][(x % 15)] = temp_localA;
         localB[y][x] = temp_localB;
       }
     }
   }

After changing the dimension of the local matrix to be consistent with the input matrix, the inconsistency is solved.

def kernel(A, B):
       localA = hcl.compute((m, k), lambda *args: 0, "localA")
       localB = hcl.compute((k, n), lambda *args: 0, "localB")
       output = hcl.compute((m, n), lambda *args: 0, "output")
       def update(k, y, x):
 
           localA[y, x] = hcl.select(x>0, localA[y, x-1], A[y, k])
           localB[y, x] = hcl.select(y>0, localB[y-1, x], B[k, x])
           output[y, x] = hcl.select(k==0, 0, output[y, x]) + localA[y, x] * localB[y, x]
 
       hcl.mutate((m, dim_y, dim_x),
           lambda k, y, x: update(k, y, x), name="update")
       return output

When trying to run with 1024x1024 matrix multiplication, the output is an zero matrix,

wondering whether it is because of the lack of multipliers in the FPGA? The FPGA used in our AWS F1 instance is VU9P which contains 6840 DSP slices, 75.9 MB BRAMs and 270 MB URAMs.
can we ask for some suggestions on the larger matrix multiplication (i.e. N=5000)? can we achieve that with the systolic array implementation in heteroCL?

Referring to #228 , after setting compile ="vivado_hls" , the synthesis process can be finished otherwise vitis platform info missing is reported.

Answer 1 · 2020-08-28T01:59:11.000Z

Hi jessewjx,

The version you are using does not really work on large matrix. For 1024x1024 GEMM, It tries to implement a systolic array with 1024x1024 PEs, which is impossible to be implemented on a single FPGA.

We have a new version that tiles the matrix into smaller pieces and performs computation one by one -- e.g. it tiles a 1024x1024 matrix into a group of 32x32 matrices, passes these sub-matrices to a 32x32 systolic array, and finally sums the results up. I am still testing it and will upload the new version within a few days. Will keep you posted.
Please make sure v++ is on the PATH by running which v++. You should set up the XRT and Vitis environment before using HeteroCL's vitis backend.

Answer 2 · 2020-08-28T02:25:15.000Z

Hi Hecmay,

Thank you for the reply and looking forward to the the systolic array based tiled matrix multiplication implementation! Also, we are wondering whether it is possible to achieve 4096x4096 matrix vector multiplication with the implementation you mentioned? thanks!

Answer 3 · 2020-08-28T02:46:37.000Z

Yeah. You can parameterize the algorithm as matrix A (4096x4096) multiplied with B (4096x1).

Answer 4 · 2020-08-28T03:16:21.000Z

Great! that is very inspiring and looking forward to the release of the new version implementation of systolic array based matrix multiplication!

Answer 5 · 2020-09-14T03:25:04.000Z

Hi @jessewjx. Please see the systolic gemm here: https://github.com/Hecmay/heterocl/blob/fix/samples/systolic_array/systolic_array_module_stream.py

It may take a long time for this PR to be merged. You can simply pull back and compile the fix branch to give it a try: https://github.com/Hecmay/heterocl/blob/fix/

Answer 6 · 2020-09-14T19:32:13.000Z

Thank you @hecmay and we will have a try with the systolic gemm!

Answer 7 · 2020-09-26T13:59:50.000Z

Hi @hecmay, I tried to run the systolic gemm on the Vitis platform. Probably due to some setup issue, “vitis platform information missing ” occurred. Please note that the vitis environment is already set up on the server.

I then generated Vivado_HLS C++ code by setting the target to “vivado_hls” and ran the C simulation. I tested the systolic array implementation with the matrix vector multiplication with matrix size=32x32 and systolic size=4 .

The resulting vector does not agree with the simple matrix vector product.
My hunch is that this could have happened because I was not executing the program correctly or that there is a minor bug in the current implementation.

Could you advice on this issue? It would be great if you could give some rough guides on how to execute the program as well.
Thanks for writing this nice piece of software, hoping to hear from you soon.