spcl/hls_tutorial_examples

Comparison between Example 6 and Example 7

Closed this issue · 2 comments

Hi,
Thanks for creating the great tutorial, and I am confused in comparing Ex6 and Ex7.
Ex6: "Replication" and "Vectorization" patterns for B and A, respectively.
Ex7: "Stream" pattern for both B and A.

Then follow the suggestion in your paper "Transformations of High-Level Synthesis Codes for High-Performance Computing", there would be a large fan-out/in problem in Ex.6 and will encounter complex routing problem, and the "Stream" architecture(Ex.7) is one of the solution to the routing problem.

So I compared cycle count, resource cost and frequency between Ex.6 and Ex.7 from the Vivado HLS report.
The cycle count of Ex.6 is lower than Ex.7 for all cases of PE group size, which means the parallelism in Ex.6 is better than Ex.7 and meets the Table1 in the tutorial paper.
Screenshot from 2020-11-13 16-52-58

However, when I increased the frequency, Ex.6 can achieve higher frequency, 454MHz, than Ex.7, 398MHz, which is mismatched with the Table1 comparison in tutorial paper. My experiment setting is D=64(PE group size=64), W=2.
Also the resource consumption is much higher in Ex.7 for all cases of PE group size.
Screenshot from 2020-11-13 17-07-38

In my code, which are modified from your Ex.6 and Ex.7 by change float to ap_uint<16>.
Ex7.zip
Ex6.zip

My environment:

  1. Vivado HLS 2018.3
  2. Board part: ZU3EG SBVA484 -1 e
  3. OS: Ubuntu 18.04

Thanks again for making this tutorial.

Hi there, I'm glad you enjoyed the tutorial!

Indeed, the major change from example 6 to example 7 is the change to a streaming architecture for the sake of routability. You will only see this effect when you try to run the placement and routing flow, so looking at the reported cycle counts and frequency estimates won't give you the full picture.
Like you observed, adding the streaming architecture actually adds additional resource overhead, but it means you can reach higher overall resource utilization, and thereby performance.

Have you seen our matrix multiplication repository? It contains a fully optimized code:
https://github.com/spcl/gemm_hls/

Hi there, I'm glad you enjoyed the tutorial!

Indeed, the major change from example 6 to example 7 is the change to a streaming architecture for the sake of routability. You will only see this effect when you try to run the placement and routing flow, so looking at the reported cycle counts and frequency estimates won't give you the full picture.
Like you observed, adding the streaming architecture actually adds additional resource overhead, but it means you can reach higher overall resource utilization, and thereby performance.

Have you seen our matrix multiplication repository? It contains a fully optimized code:
https://github.com/spcl/gemm_hls/

Thanks, and will move on to the gemm repository.