Use UTL to load/store NN configurations
seldridge opened this issue · 2 comments
The current approach of loading an NN configuration through the L1D$ of the Rocket (i.e., through the RoCC interface's mem
port) is slow and is likely overwriting a large portion of the data in the L1D$. This type of load/store is much better viewed as an uncached load directly from L2 which can be accomplished using one of the AUTL/UTL ports.
The UTL interface will, since it's talking directly to the TileLink and therefore L2, need to speak physical addresses. In my understanding, this can be accomplished in one of two ways:
- Rely on the kernel to handle the address translation with
virt_to_phys
oflinux-X.X.XX/include/asm-generic/io.h
- Do address translation directly on the accelerator using the TLB ports
The former approach seems to be the way to go for the following reasons. I think that the address translation that occurs when the accelerator accesses the TLB ports of rocket will happen within the context of the current rocket process. The nature of X-FILES/DANA does not necessitate that a transaction is synchronous with respect to the context of Rocket, e.g., a long running learning transaction from a previous process may need to writeback learned weights to memory. Relying on the TLB port to handle address translation seems like the wrong way to go. Furthermore, it is likely that the size of a neural network configuration will extend beyond a page. This would necessitate additional overhead if the accelerator has to keep track of where it is in a page and then do additional page table walks to get the next page. Avoiding this (with contiguous pages) necessarily involves getting the kernel to explicitly manage the ASID--NNID table and we might as well just have it setup physical addresses.
From what @handong32 has stated, we can use the following functions to setup and implement the former approach:
alloc_pages
orkmalloc
to get physically contiguous memory for neural network configurations and the ASID--NNID table- The only oddity here is that with the entire ASID--NNID table living in kernel memory, this is implicitly pinned (I believe). We're driving up the memory usage of the kernel, potentially unnecessarily. Ideally it would seem like the ASID--NNID table would live in kernel memory, but the NN configurations would reside in user memory.
@handong32 -- This is likely what's blocking you. Physical reads to the L1 D$ should barf. I did some initial work towards this over a month back (0c62436), but got stuck as I didn't have any physical memory to test it with. (I tried to add a physical memory ANT into the Proxy Kernel but stopped as I was just replicating your work).
This should be pretty straightforward to get working as Berkeley has an example for using the L2 UTL interface here: https://github.com/ucb-bar/rocket/blob/master/src/main/scala/rocc.scala#L185. Testing is a problem, however, as the C++ model takes ages to boot Linux.