V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer.

Question

V0.10 FINN Build Flow fails at step_hw_ipgen due to Error in threshold hls layer.

Arbiter-glitch opened this issue 5 months ago · 15 comments

Prerequisites

Possible Bug in the current version v0.10 of FINN.

Quick summary

Ipgen step for the threshold hls layers in the latest version of finn (v0.10) fails. The stack trace is included below.

This shows the error in one layer. But when the logs of other Threshold hls layers were checked, same error was found. The log files of some layers are included here.
--> ERROR: [HLS 207-814] "Bitwidth exceeds 32768 (1 << 15), the maximum allowed value" <--
logs.zip

Steps to Reproduce

Clone the FINN repository main branch
Start the docker container with the command: sudo service docker start
Set the following export varibles
export FINN_XILINX_PATH="/your/xilinx/path"
export FINN_XILINX_VERSION=2022.2
export PYNQ_BOARD=ZCU104
Before running the docker script, I have added a custom step in the buid_dataflow steps. This step is used for preprocessing images by dividing the input by 255. For this step to be part of the default build steps add it to the build_dataflow_config .py file in the builder folder in src/finn.
Go to the cloned Finn directory and run docker script file with build_dataflow and path to the folder with the model and the config file. The model file, the required config files, and the build dataflow steps python file with the streamlining steps used are included here.
files.zip

Expected behavior

Completion of the Finn steps

Actual behavior

Fails at Ipgen step due to error in hls layers, resulting in not making the corresponding required IP.

Additional Info

I came across this bug because of an issue here. I tried to make it work in v0.10 by replacing the hls threshold layers that throw the error, with RTL variants but the expected image output was not produced. Since, I was getting the expected output in v0.9, I wanted to find the layer responsible for the wrong output in v0.10, by making all layers as hls like v0.9 but it threw an error in the threshold layers. Refer to steps above to reproduce the error.

Answer 1 · 2024-04-24T09:29:32.000Z

Hi @Arbiter-glitch, from the logfiles that you have provided, I see this error: ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive (/mnt/f/workdir/Vitis_HLS/2022.2/common/technology/autopilot/ap_common.h:520:35) . This looks like you are trying to use a different datatype than expected, are you using the code generation from the FINN flow?
How does the generated .cpp code for the layer that fails look like? The one from this repository: /mnt/f/linux/test/code_gen_ipgen_Thresholding_hls_5_f33du_vs/top_Thresholding_hls_5.cpp

Answer 2 · 2024-04-25T02:44:03.000Z

Hi @auphelia, I have given "test" as my build directory by setting it through export FINN_HOST_BUILD_DIR command. Apart from changes mentioned earlier, i.e. in the config file for specializing layers and the build_dataflow file. I have made no changes anywhere else in the FINN flow. I set "test" as my build dir because I am using FINN in WSL and the default host build dir temp is inside the Linux environment which cannot be accessed after the WSL has ended.

I saw this particular info INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<25245>' requested here (/mnt/f/linux/test/code_gen_ipgen_Thresholding_hls_5_f33du_vs/top_Thresholding_hls_5.cpp:28:1) in the log file as well but I am using the code generation from FINN flow itself.

Answer 3 · 2024-04-29T18:05:28.000Z

HLS has a maximum stream width limitation of 8192 bits. You could be running into this because at least one internal stream within the FINN accelerator is too wide. If this is the cause, I'm afraid you'll have to decrease the parallelism (PE or SIMD) for the offending layers or switch to the newly released RTL back-end, which shouldn't have this limitation.

It's just weird that this was working in v0.9. Maybe some change caused incorrect folding of the thresholding layers?

Answer 4 · 2024-04-29T18:16:09.000Z

I used the same folding factors that I used in version v0.9. I tried replacing the layers that threw the error with RTL layers in v0.10, it gave the synthesized bitfile but not the correct output whereas in v0.9 when using old layers (HLS), even though each thresholding layers were taking a lot of resources like around 13K luts each, it gave the correct output. I had this doubt that there might be something wrong with Thresholding RTL layers in v0.10 because they were using way fewer resources like around 800Luts each when compared to old layers in v0.9.

But its also entirely possible that in v0.10 the incorrect output may also not be due to Threshold layer (RTL) but some other layer.

Answer 5 · 2024-04-30T09:44:16.000Z

[Update]:

To find out why the output is different in finn v0.10. I took Finn v0.9 and started to replace each of the old layers with new RTL layers, I checked at each stage. And it was only after replacing the old Thresholding hls layers with RTL layers, the output varied from the expected to give the white one again. So the problem might be with the thresholding rtl layers.

Answer 6 · 2024-04-30T09:47:43.000Z

Thanks @Arbiter-glitch for looking closer into this!
Could you provide an isolated example? So, with which settings of the Thresholding layer do you see a difference between HLS and RTL?
Maybe using the Thresholding test case can help with that: https://github.com/Xilinx/finn/blob/dev/tests/fpgadataflow/test_fpgadataflow_thresholding.py#L141

Answer 7 · 2024-04-30T13:22:29.000Z

I have replaced all the Thresholding hls layers (mem_mode "const") of v0.9 with v0.10 rtl layers. The attributes of the different threshold rtl layers are given in the images below.

For image 1 in v0.10

n_inp_vecs = [1, 40, 40]

Function calls

test_fpgadataflow_thresholding(impl_style="hls", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="internal_embedded")

Result:

Too many warnings. Cannot list them all.

test_fpgadataflow_thresholding(impl_style="hls", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="internal_decoupled")

Result:

INFO: [HLS 207-4518] in instantiation of template class 'ap_uint<18360>' requested here (/mnt/f/linux/test5/code_gen_ipgen_Thresholding_hls_0_7gbhs8ae/top_Thresholding_hls_0.cpp:28:1)
ERROR: [HLS 207-2163] 'bitwidth' attribute requires integer constant between 1 and 8191 inclusive

test_fpgadataflow_thresholding(impl_style="rtl", idt=DataType["INT24"], act=DataType["UINT8"], nf=4, ich=12, exec_mode="rtlsim", mem_mode ="")

Result:

A bunch of warnings and an assertion error "assert np.isclose(exp_cycles, cycles_rtlsim, atol=15)" which seems not related to variation at output. But final output went back to white colored one as soon as I used the Threshold RTL layers in v0.9. And Threshold RTL layers in v0.10 is also giving the same white output??

                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  132 |  localparam int unsigned  MAX_PENDING = (DEEP_PIPELINE+1)*N + 3;
      |                                                       ^
                ... For warning description see https://verilator.org/warn/WIDTH?v=4.224
                ... Use "/* verilator lint_off WIDTH */" and lint_on around source to disable this message.
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:220:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_URAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  220 |      DEPTH_TRIGGER_URAM && (DEPTH >= DEPTH_TRIGGER_URAM)? "ultra" :
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:221:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_BRAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  221 |      DEPTH_TRIGGER_BRAM && (DEPTH >= DEPTH_TRIGGER_BRAM)? "block" :
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:223:25: Logical operator LOGAND expects 1 bit on the LHS, but LHS's VARREF 'DEPTH_TRIGGER_BRAM' generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  223 |      DEPTH_TRIGGER_BRAM && (DEPTH >= 64)? "distributed" : "auto";
      |                         ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:137:63: Operator ASSIGN expects 6 bits on the Assign RHS, but Assign RHS's SUB generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  137 |   logic signed [$clog2(MAX_PENDING):0]  GuardSem = MAX_PENDING-1;  
      |                                                               ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:140:22: Operator ASSIGNDLY expects 6 bits on the Assign RHS, but Assign RHS's SUB generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  140 |    if(rst)  GuardSem <= MAX_PENDING-1;
      |                      ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:213:29: Operator EQ expects 32 or 2 bits on the LHS, but LHS's SEL generates 1 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  213 |    uwire  cs = (p.ptr[SN:0] == 2**SN-1);
      |                             ^~
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:344:17: Operator LTS expects 32 bits on the LHS, but LHS's VARREF 'APtr' generates 6 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  344 |     assert(APtr < $signed(A_DEPTH-1)) else begin
      |                 ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:362:17: Bit extraction of array[17:0] requires 5 bit index, not 6 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  362 |     BDat <= ADat[APtr];
      |                 ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:369:31: Operator ADD expects 32 bits on the LHS, but LHS's SEL generates 8 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  369 |    assign odat[pe] = BDat[pe] + BIAS;
      |                               ^
%Warning-WIDTH: /mnt/f/linux/test5/code_gen_ipgen_Thresholding_rtl_0_3dey3ahz/thresholding.sv:369:20: Operator ASSIGNW expects 8 bits on the Assign RHS, but Assign RHS's ADD generates 32 bits.
                                                                                                    : ... In instance Thresholding_rtl_0_axi_wrapper.core.impl
  369 |    assign odat[pe] = BDat[pe] + BIAS;
      |                    ^

Note: I tried the test in v0.9 as well, but the hls layers that worked and gave correct output used mem mode "const" even though there were many warnings. But for mem_mode "decoupled". It gave the same bitwidth error in v0.9.

[Update]:

In light of the above tests, I tried the finn v0.10 with Thresholding hls and mem_mode="internal_embedded" and it works.
The problem is with the mem_mode="internal_decoupled" just like in the tests. And thresholding RTL even though the tests do not show any variation in output. My final output is very varied.

Answer 8 · 2024-05-02T04:49:38.000Z

I have posted the following images to show the variation in output due to threshold RTL layers. The first image is obtained with all threshold_HLS layers and mem_mode=internal_embedded. The Pynq board that I am targeting is the ZCU104 evaluation board. If you zoom in on the images obtained from FPGA, you can see square patches since I am processing it patch by patch and then joining them together. The second image is obtained after replacing one HLS with a Threshold RTL layer. The onnx model image is also given.

All Threshold_HLS layer

One RTL Threshold layer

Corresponding Onnx model image [from step_specialize_layers]

Answer 9 · 2024-05-13T10:54:57.000Z

Hi @Arbiter-glitch ,
We just pushed a fix for the RTL Thresholding layer, can you please check it out if that solves your problem.

Answer 10 · 2024-05-14T05:23:30.000Z

Hi @auphelia I checked it with the dev branch, but it has not solved the problem.
However, there is a difference from the earlier all-white output.

Earlier bad output

Current bad output

The expected Correct Output taken from threshold hls layers (internal_embedded)

Answer 11 · 2024-05-14T07:46:17.000Z

Hi @Arbiter-glitch,

You are dealing with quite a high bit width (for QNNs on FPGAs) and that causes (at least in the HLS Thresholding) the error in the HLS synthesis. I reproduced according to your setting above and to represent the generated thresholds, you will need the data type INT24 (like the input). When you use the internal_decoupled mode, the threshold memory is not part of the HLS code (see here: https://finn.readthedocs.io/en/latest/internals.html#hls-variant-of-matrixvectoractivation-mem-mode) and you will have a streaming interface which is defined by the setting of PE, the threshold bit width (INT24) and the number of steps (coming from the output activation). You can find the code here: https://github.com/Xilinx/finn/blob/dev/src/finn/custom_op/fpgadataflow/hls/thresholding_hls.py#L117
This results in a weight stream width of 24480 (hls::stream<ap_uint<24480>> &weights_V) in the generated top_Thresholding_hls_0.cpp file, which exceeds the allowed 8191 (limitation from Vitis HLS). Using internal_embedded with HLS solves the problem, like you mentioned above.

Regarding the RTL Thresholding:
Could you please check again, if you could provide a test case for the RTL Thresholding that shows incorrect behaviour? You could isolate a specific node from your network.

For debugging purposes, you could execute the model either right after streamlining or after convert to hw layers with return_full_exec_context=True and doing the same with node-by-node rtlsim after specializing the layers. Then a dictionary will be returned with every intermediate tensor and you can better compare different network settings.
There is a verification notebook that can help setting up the execution manually, you can use the onnx files from the intermediate_models folder:

Answer 12 · 2024-05-15T11:41:00.000Z

I did the execution of model after convert_to_hw layer and compared the output tensors according to the verfication notebook. The golden output was the output from the, all threshold hls layer network. Both results were the same for convert_to_hw layers.
But in node-by-node rtl sim, the output result for network model with threshold rtl was different from the threshold hls layer. I tried to get dictionary values of intermediate tensors for rtlsim by using [return_full_exec_context=True] in the onnx execution but all the displayed intermediate tensors were showing all zeros except for the final global_out. I tried using the finn.core.onnx_exec.execute_onnx_and_make_model(model, input_dict) to get the intermediate tensor in visual form in netron but it shows the same. This execute_onnx was done to get the tensor values from the threshold_hls layer network for comparison with the wrong output threshold rtl layer network. Also giving start_node and end_node for execute_onnx did not stop execution at the end_node.
check the zip file for the model file with intermediate tensor values as well as the verification notebook ipynb file.
Verification.zip

Answer 13 · 2024-05-16T09:20:59.000Z

Hi @Arbiter-glitch,

I know it's getting a bit into trial & error territory, but could you try your example with the latest dev again?
In the previous RTL Thresholding implementation, we had the requirement that input data type = threshold data type.
We implemented now the option to set these data types independent from each other. I can see in one of the images you shared above that your threshold nodes have different values for input data type and threshold data type, so the error could be coming from this.

Regarding your trial with the execution flow:
The execution functions in FINN are based on tensor names, an error that often happens is that a certain name for a tensor is assumed e.g. "global_in" but the actual name of the tensor is different, you can prevent that by not using a fixed string but instead using e.g. model.graph.input[0].name in the input_dict.

Answer 14 · 2024-05-17T06:12:41.000Z

Hi @auphelia
Thanks...Its working now. Could you tell me whether it's possible to split a 3-channel tensor to three separate single channel tensors and process them separately in separate networks using finn. Because when a split node is introduced, I don't know how to convert it. How do I introduce a fork in onnx model. I found a concat layer but no split?

Answer 15 · 2024-05-17T08:20:38.000Z

Thanks @Arbiter-glitch,
Glad to hear that it's working with the RTL Thresholding now! Your project looks interesting, if you like you could make a post in the Show & Tell section about it to share with other users.

Could you post your next question in GitHubDiscussions please? I will close this issue and related issue #1055 and mark it as solved by PR #1077.
Keeping the discussions separately makes the repo better searchable for other users with similar questions/issues.