StanfordAHA/Halide-to-Hardware

Error when doing make run-clockwork on new max pooling example

dillonhuff opened this issue · 16 comments

@jeffsetter @thenextged I've gotten the CPU code compiling and running for my max-pooling example (https://github.com/StanfordAHA/Halide-to-Hardware/tree/maxpool_example/apps/hardware_benchmarks/apps/max_pool_2x2), but when I run make run-clockwork I get the following error:

dhuff@kiwi:~/h2h2clockwork/Halide-to-Hardware/apps/hardware_benchmarks/apps/max_pool_2x2$ make run-clockwork
g++-7 -std=c++17 -I../../../../../clockwork -I../../../../../clockwork/include -I/home/dhuff/h2h2clockwork/Halide-to-Hardware/clockwork/barvinok-0.41/isl -fPIC -I/home/dhuff/h2h2clockwork/clockwork/barvinok-0.41/isl/ -c bin/clockwork_codegen.cpp -o bin/clockwork_codegen.o
In file included from bin/clockwork_codegen.cpp:2:0:
bin/maxpool_memory.cpp: In function ‘prog maxpool()’:
bin/maxpool_memory.cpp:11:27: error: ‘arg_0’ was not declared in this scope
 int32_t &hw_output_s0_c = arg_0;
                           ^~~~~
../../hw_support/hardware_targets.mk:138: recipe for target 'bin/clockwork_codegen.o' failed
make: *** [bin/clockwork_codegen.o] Error 1

Any idea what is going on here? Thanks!

Let me take a look (although I think Jeff might be of more help with this error)

Hmm. I'm getting a different error. Are bin/clockwork_testscript.h and bin/clockwork_testscript.cpp generated when you run-clockwork ?

I've seen several different errors depending on the order in which I call the make commands.

Yes I see bin/clockwork_testscript.h and bin/clockwork_testscript.cpp in bin after I run make run-clockwork.

The other error I sometimes get is:

[INFO] Module.compile(): clockwork_source_name = ./bin/maxpool_clockwork.cpp
creating file from name: maxpool
[INFO] Module.compile(): clockwork header file = ./bin/maxpool.h
creating file from name: maxpool
outputting clockwork target named maxpool
Error at ../../../../distrib/tools/GenGen.cpp:4:
Can't represent an integer with this many bits in C: (void *)
../../hw_support/hardware_targets.mk:132: recipe for target 'bin/maxpool_memory.cpp' failed
make: *** [bin/maxpool_memory.cpp] Aborted (core dumped)
make: *** Deleting file 'bin/maxpool_memory.cpp'

What error are you seeing?

Ah! that's the error I'm getting. I'm tracing it now

I just pushed a temporary fix to this maxpool_example branch. can you pull this and try again?

@thenextged trying now.

@jeffsetter The _hls_target ProducerConsumer is getting inserted at the wrong place in this app. The outer loop level is outside the _hls_target PC node (which should never happen). I think the right thing to do is to make sure that the _hls_target PC node is the first child of the output function's PC node

For now, I went back to using the name hw_output as the accelerator boundary

@thenextged is there an issue with the way the app itself is written that I could fix? Or is the problem in code generation?

@thenextged just finished running and comparing max pool and they diff the same. Thank you!

Great! I think the app's Halide code is fine. The codegen fix shouldn't be too bad (I'll let Jeff comment on that)

To adjust the loops, you can use hw_output.reorder(xi, yi, c, xo, yo)
This moves the loops so that the channel loop is within the accelerator (which includes loops between xi and xo.

Patch that evaluates correctly is:

index e2238b484..390c303db 100644
--- a/apps/hardware_benchmarks/apps/max_pool_2x2/maxpool_generator.cpp
+++ b/apps/hardware_benchmarks/apps/max_pool_2x2/maxpool_generator.cpp
@@ -77,6 +77,7 @@ public:

           hw_output
             .tile(x, y, xo, yo, xi, yi, imgSize, imgSize)
+            .reorder(xi, yi, c, xo, yo)
             .hw_accelerate(xi, xo);

           pooled.compute_at(hw_output, xo);
diff --git a/src/CodeGen_RDAI.cpp b/src/CodeGen_RDAI.cpp
index f102879e5..b7e658b66 100644
--- a/src/CodeGen_RDAI.cpp
+++ b/src/CodeGen_RDAI.cpp
@@ -163,7 +163,8 @@ void CodeGen_RDAI::visit(const Call *op) {
 }

 void CodeGen_RDAI::visit(const ProducerConsumer *op) {
-    string target_prefix = "hw_output";
+  //string target_prefix = "hw_output";
+  string target_prefix = "_hls_target";
     if(starts_with(op->name, target_prefix) && op->is_producer) {
         Stmt hw_body = substitute_in_all_letstmts(op->body);

@jeffsetter both of these loop nests should be valid 1: (xi, yi, xo, yo, c) and 2: (xi, yi, c, xo, yo)) right? I think if we make sure that all For nodes in-between hw_output and _hls_target PC nodes are moved inside the _hls_target PC node, then all will be good. Otherwise, we're limiting ourselves in terms of schedules

Yea I think both loop nests should be allowed. If you want to have all of the loops done on the accelerator you can include hw_accelerate(xi, c) with reorder(xi, yi, xo, yo, c).

However, I'm not sure why all of those loops need to be in the accelerator. Is there some assumption in that loops can't be outside the hls_target

Yeah the issue is I'm not supporting scalar variables as accelerator function parameters yet (which we will need in this case to pass the current value of c if it is outside the accelerator). A better solution if for me to add support of that. But I might not get to it at least until after I get the FPGA flow to work

Or as an alternative, I can emit crops/slices and rewrite the IR to collapse the external dimensions. This way we only keep buffer arguments.
Any ideas about other/better alternatives?

Also, I want to make sure that the strategy I choose here is compatible with your target generator