farhanrahman/riffa

Write an hdl wrapper for the hardware core

gmingas opened this issue · 7 comments

Write a wrapper which will sit between the hardware core and the RIFFA modules. This should include double buffers for sending and receiving data together with the necessary control (state machine, etc) to communicate with the basic software functions.

You also need to be able to accommodate for multiple consecutive DMA transfers. This will possibly require a software loop and some kind of hardware signal to notify the software that the loop needs to break (e.g. by sending a special data pattern).

Also, maybe try using more than one PCIe lanes (hasn't been tested by Matt).

The progress on this is the following:

  1. in the branch simple_dma the hardware written now successfully receives a chunk of input data of any size from the PC and sends back the data received. This is contrary to the original example which only received 2 input arguments. This has been tested. There was a small bug and that was not resetting the bramAddresss after dma transfer back to the PC. This bug has been fixed now. I however now need to test how the hardware will behave when the BRAM is very small i.e. the dma_handler will have to do multiple dma_transfers: in the simulation this works fine.

  2. In the master branch I have set up riffa_interface to connect any number of inputs to a core that is defined as test_core. I am now working on the output side where I have to implement double buffering.

Basic wrapper has been written. Look at commit 4e03c2f for more details. The wrapper is not optimised in the sense that it does not use double buffers yet on the output which is crucial for stable throughput of the user's core.

Wrapper now working. As stated in the commit message, there is a minor issue being the same as faced for simple dma transfer i.e. small data transfer results in occasional DMA failures. Matt has said that he will look into it as he has only tested dma transfer of 1024 bytes.

Look into commit 0707f36 for more information. Master branch contains most up to date code.

Okay I think I might know why the occasional DMA transfer of 0x0 bytes occurs. There is a configurable FIFO in the DMA block. I think the DMA block waits until the FIFO is full before it does a DMA transfer back to the PC. I tested my hardware with a test_core that basically counts up and the good thing is that no data is lost even though i get occasional 0x0 length DMA transfers. For example in attempt 1 (i.e. running program ./riffaexample) I get 0xc bytes of data i.e. 12 bytes (3 outputs) which are 0,1,2. In the second run I get 0x0 bytes of transfer but there is no error detected by the software, which means that the DMA block intentionally sends 0x0 bytes. In the third run the DMA transfers double the amount of data i.e. 24 bytes with the order preserved i.e. 3,4,5,6,7,8. The fact that the PC is able to send data to the FPGA proves that the state machine is not stuck and always starts from idle state and the fact that every test happens instantly without any error proves that the block goes through the state machine normally. Initially I thought due to encoding of the states, my state machine could possibly be stuck in a particular state (im still not sure if this could be a big problem).

Now going back to the FIFO, it is configurable and the depth is currently set to 48 bytes and burst reads and writes set to 16. I think if I reduce both these values, I might force the DMA to always transfer bytes of data in every DMA transfers to the PC. I will try this out today and see what happens. As mentioned earlier, the DMA block always transfers data when bigger chunks are used.

Okay I have tried out different FIFO depths and lengths. It turns out that FIFO length and rw/wr bursts do play a role in the reliability of DMA transfers. I basically found these points where dma transfer never sends 0x0 bytes. For example with FIFO depth of 8 and Read and Write bursts set to 1, the DMA transfers become reliable at data transfer of 72 bytes i.e. 18 words. So the DMA block never transfers 0x0 bytes when its sending at least 18 words back to the PC. However with this setting its less efficient to send back bigger chunks of data. For example sending back 1024 bytes of data is more efficient with a setting of 32 FIFO depth, and read/write burst set to 16 i.e. it takes less time to send 1024 bytes of data in this setting.

Really stuck at the moment. Not too sure what is going on.

The wrapper is now compatible with other IP cores. Will update the documentation no how a user can use this interface to connect his/her IP core to the riffa_interface block. Will start optimisations later if required.