Using cfu unit as storage

Question

Using cfu unit as storage

bala122 opened this issue 2 years ago · 3 comments

Hi @tcal-x and @mithro ,
In my network/model, I've noticed that there is a high memory bottleneck compared to computation time, so I wanted to try to use the cfu unit (single cycle) for storage/ as extra buffers used by the vexriscv core to show speedup. However, surprisingly, I'm either seeing the same runtimes or actually worse. Is it an issue with how long the core takes to interface with the cfu? Any advice on this/ any other alternative like a data-scratchpad memory? Ive tried changing the cache sizes, L2 sizes, maxxed out I-cache as well, but Im not seeing improvements.
Another weird thing Ive seen is that when I remove the condition checks for checking if the point is in the image ( in conv.h ), I'm noticing really high reduction in runtimes. For example, as in the original conv.h file, you can see this below

   const bool is_point_inside_image =
                  (in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
                  (in_y < input_height);

              if (!is_point_inside_image) {
                continue;
              }

Please let me know about this,
Thanks,
Bala.

Answer 1 · 2022-09-25T21:21:32.000Z

@bala122 , we also used CFU storage as an isolated function as an intermediate step in some of our accelerators. For example, we'd push the data in, then in our main processing loop, the CPU would pull a data word out of CFU storage for processing either on the CPU or by using a different CFU function. As you saw, this usually did not gain a speedup. I don't think we dived into waveform-level analysis, but that would be a useful thing to do here. In this case, the Renode (with Verilator) simulation is not so useful since the CPU modeling is not cycle accurate. Here, you would want to do full Verilator simulation (including the CPU). The build is a single step:

make PLATFORM=sim load <other options>

Don't set TARGET; it is ignored. But you can set the --cpu-variant using EXTRA_LITEX_ARGS.

The CFU-Playground software, when compiled for simulation, has a menu item to enable tracing. So, if you have your small benchmark added to the project menu, then start the simulation, enable tracing, run your benchmark, and quit as quickly as possible to keep the trace file size reasonable. Note there's a bug that it produces .vcd files even though it should be producing .fst files.

Finally, that computation that you show is very familiar to me --- that was the very first code that I attempted to speed up using a CFU!! I converted the sequence of comparisons to a single-cycle CFU instruction. And I did get more improvement than I expected, given that it's not even in the inner-most loop. The sequence of conditional branches seems to be quite expensive. However, you may be seeing another effect. Any time the software is changed, you may indirectly cause an increase or decrease in I-cache misses due to changes in code placement.

(Also note that this check is suboptimal as written; the check in one of the dimensions can be moved to a more-outer loop.)

Answer 2 · 2022-09-26T04:55:46.000Z

Thanks for that @tcal-x ,
Actually, I realised that the stagnation in runtimes might have been due to the small size of the layer I was dealing with. I used lower d-cache and L2 sizes/ moved to a bigger layer with the same size and noticed improvements. However, it's still surprising that the runtime stagnated at such a high value ( in millions of cycles for the give layer), even though it could fit in all the data in d-cache or L2.
I thought that fundamentally the vexriscv core can't do better than this given the number of instructions in the inner most loop and the total counts it takes.
Or , it could be that there is still instruction data traffic between L1 and L2 even though the L1 I cache is big which seems unlikely. That's why, I wanted to know about the exclusivity/inclusivity of the L1-L2 cache hierarchy (here)

Answer 3 · 2022-09-26T17:35:04.000Z

@bala122 there is an interesting potential project here -- since it's a soft CPU core, we are free to modify it to add instrumentation counters. For example we could add counters to record the number of Icache hits & misses, and same for Dcache. It would take some mucking around in the SpinalHDL Scala, but it is doable (I've already prototyped a dcache miss counter CSR).

Or, again, looking at waveforms might be informative and reveal unexpected stalls and such.