darklife/darkriscv

Expected data bus latency?

rdb9879 opened this issue · 1 comments

I'm testing this core in simulation using code generated from riscv32-unknown-elf-gcc with -march=rv32i with no optimizations. All my instruction and data bus accesses should have a latency of 1 clock cycle. I've commented the assembly code for what I expect to happen, and then highlighted the area where I think it's failing, as well as put a marker at the simulation readout:

Fail

So I am expecting the opcode at 0x8f8 to read from address 0xab030000, which should return a value of 0x00000000. I can see the read being performed on the data bus at address 0xab030000, and then 1 clock cycle later the read data on the bus is 0x00000000 as expected, however, at that point, the incorrect value is loaded into a4. It looks like it loaded a4 with the value of read data from the exact same clock cycle that 0xab030000 appeared on the address lines, which means maybe it's expecting instantaneous return data?

Is this a bug, or does this core only support zero latency reads?

Hi!

You are right: the DarkRISCV is designed to maximize the code throughput for unrolled register-to-register code, so you can peaks IPC ~ 1 in this special case only: no branches and no load/store. You probably already found that branches take a lot of time, basically because the core is pre-fetching the PC+4/PC+8 and will just flush everything in a branch that jumps to a different PC. The workaround in this case is try unroll loops, in a way that you do more processing in each loop. In the case load/store, there are lots of possibilities...

Basically, it depends how the memory is defined, wired and, sometimes, the FPGA model... so, case your application needs more code (ROM), but can use a small amount of data (RAM), maybe is possible use LUTRAM for data and set the load/store to 0 wait-states. Typical configuration may use BRAM for both code and data, so you will need 1 wait-state for load and 0 wait-state for store (the reference DarkSOCV works this way). And some special cases may not enable individual byte write in the BRAM, so you may need read/modify/write operation to write individual bytes, resulting in slow stores.

An additional 4th stage may help, but only for a while: case the system complexity increases, in a way that the load needs more pipelining (because the system is too large and cant afford too many data muxes stages), we will need additional wait-states, so I avoided the 4th stage in order to keep the design as simple as possible.

But how this impact your code? Well, it depends... typical code compiled w/ -O2 will try keep all processing on registers, so there is little need for load/store operations. One possibility to increase the load/store performance is use some tricks, such as burst access, FIFOs, large buses and LUTRAM caches, in a way that the 1st load probably will take 1 extra wait-state, but further loads can run with no wait-states.

Finally, another possibility is use the 2-stage pipeline w/ a 2-phase clock, in a way that the top level (IO and BRAMs) will operate in the positive clock edges and the core will operate in the negative clock edges, so the BRAM will appears to the core as a zero latency memory. The main disadvantage of this approach is that the maximum clock will be half of the 3-state pipeline, but at least it make the operation very simple. In fact, the reference design DarkSOCV can operate this mode when you comment the 3-stage pipeline define in the config.vh.

Well, I hope this info helps you solve your problem!