feature: support for unaligned access with special address increment pin
jeras opened this issue · 7 comments
Problem
I am writing a RISC-V CPU, and I could take advantage of RAM with dedicated unaligned access features for 16/32 instructions in case the C extension (compressed 16-bit instructions) is available.
Most RISC-V implementations use a simple fetch buffer to handle unaligned accesses, but I would like to achieve IPC=1, so it is not acceptable to require 2 clock cycles for an unaligned access.
A trivial implementation can be used to describe what I am looking for.
For an aligned access the 32-bit opcode op
would be a full 32bit read from the memory mem
at address addr
.
OP[31:0] = MEM[addr][31:0];
For an unaligned access, two consecutive memory locations must be read.
OP[31:0] = {MEM[addr+1][15:0], MEM[addr][31:16]};
In RTL this can be achieved by splitting the RAM into 2 16-bit parts.
The lower part address would have a mux between addr
and addr+1
depending on whether the access is unaligned.
And the data from the two memories would be swapped (data mux) on unaligned access.
While the multiplexers are reasonably fast, the adder (actually just an incrementer from half adders) would add significant delay due to carry propagation.
Solution
My proposal is to integrate the incrementer into the memory decoder itself, but instead of implementing it at the address input, it could be implemented after the address is decoded by switching between two consecutive decoded signals.
In an RTL representation the increment signal inc
would be used to denote unaligned access and the decoded array is dec
.
for (i=0; i<size; i++) dec[i] = inc ? (i = addr) : (i = addr-1);
This approach would avoid the carry chain and would therefore have much better timing.
There are further considerations on how to handle the last and first address, and how to handle memory spaces split into multiple memories. This kind of memory could also be used for instruction caches, and there would be further corner cases to consider.
The asymmetry, where only the lower 16-bit part of the memory needed the incrementer bothered me until I figured, this is only the case for little endian access. If both little and big endian accesses were supported, than all bytes in a word would be separate memories with separate.
Ugly FPGA hack
On an FPGA with enough memory to waste, there could be two memories used, for at least the lower data half. One for aligned and one for unaligned accesses.
Questions
The questions are here for someone with more memory design knowledge than me.
- Would such an approach really produce the expected timing advantage.
- How much area overhead would such a feature add to memories?
- If current memory arrays decode the full 32-bit data width, would such an approach require separate decoders for each half (instruction memory) or byte (data memory) of the array. What would the area overhead be in this case.
- Data swap multiplexers could be added to the memory, but it would not make much sense in case the memory space is split into smaller memory blocks.
And a question for CPU RTL designers.
- How to write a better instruction/data cache with such a memory.
Hi Jeras,
This isn't really feasible for an SRAM as it would take multiple cycles to access multiple addresses. The SRAM needs to precharge for every access and this process is done per clock cycle. This defeats the purpose of what you want...
The address decode is a small part of the memory delay, so the performance of the decoders isn't a huge deal. Most of the delay, especially for larger memories, is the read access by sensing the bitlines.
You could possibly do something with a two port memory. For example, if you fetch both addresses addr and addr+1 from different ports and then add logic to combine them based on the alignment, it would give the same effect.
Hi Matt,
I wrote the previous post in a hurry, this time I prepared a few schematic diagrams.
I am sure what I am describing can be done, but I am not sure if it is worth doing, if it provides any advantage.
RAM with aligned access only
For clarity lets start with a memory setup only allowing 32-bit aligned accesses.
For an aligned access the 32-bit opcode op
would be a full 32bit read from the memory mem
at address addr
.
op[31:0] = MEM[addr][31:0];
RAM with external incrementer
For an unaligned access, two consecutive memory locations must be read.
OP[31:0] = {MEM[addr+1][15:0], MEM[addr][31:16]};
In RTL this can be achieved by splitting the RAM into 2 16-bit parts.
The lower part address would have a mux between addr
and addr+1
depending on whether the access is unaligned.
And the data from the two memories would be swapped (data mux) on unaligned access.
The provided image shows address and read data paths, first for aligned and than for misaligned access.
RAM with internal incrementer
My proposal is to integrate the incrementer into the memory decoder itself,
but instead of implementing it at the address input,
it could be implemented after the address is decoded by switching between two consecutive decoded signals.
In an RTL representation the increment signal inc
would be used to denote unaligned access and the decoded array is dec
.
for (i=0; i<size; i++) dec[i] = inc ? (i = addr) : (i = addr-1);
The following image shows the incrementer placed between the decoder and the bit cell array.
Each line in the bit array has its own multiplexer.
First the address path without incrementation is shown, below is the path with incrementation.
This approach would avoid the carry chain and would therefore have much better timing.
There are further considerations on how to handle the last and first address, and how to handle memory spaces split into multiple memories. This kind of memory could also be used for instruction caches, and there would be further corner cases to consider.
Hi Jeras,
What you are describing now can mostly be achieved by using two separate 16-bit memories. These would need to be separate because you are accessing two different rows simultaneously which can't be done unless you have a second port like I previously described.
The incrementer is quite specialized and could be integrated but I don't think it would be worthwhile. This is not the common case. The entire decoder would need to remain there so it wouldn't reduce logic. I'd suggesting just supplying the incremented address to the second memory.
Thanks for taking the time.
Yes the main idea to use two 16-bit memories instead of a single 32-bit.
The idea behind pushing the incrementer into the RAM block was not to save logic, but to improve timing.
The delay of an incrementer working on a binary encoded address is defined by carry propagation and would be larger than the delay of an incrementer operating on the decoded address.
Sure the question remains if this feature would be attractive to RTL designers.
I am writing a CPU which executes every instruction in a single clock period, and so to be able to support the C extension, I need a setup which supports misaligned accesses. In my specific case it would be a clear advantage.
It is less clear how this would affect performance for a more generic CPU.
As a quick calculation would assume 50% instruction fetches are unaligned, but this only affects performance on taken branches and jumps. I tried to find out how common are branches, but I do not have the right keywords to google it. I will use some number I head once, which would be one instruction in six is a branch, and only about half are taken. This would give about 10% of instruction fetches are from a non consecutive address. Combined with the 50% ratio of unaligned addresses, the performance impact would be less than 5%.
Probably not worth designing a custom RAM.
Maybe somebody who designed the instruction fetch unit and caches for a high performance CPU would have additional comments, but this is a subject for a different forum.
I think the assumption that 50% of accesses are unaligned would need to be supported. C compilers generally have alignment rules specifically for this reason. Unaligned accesses are normally an exception and not the common case. In fact, some ISAs will literally trigger an exception.
The addresses for fetching sounds like you are describing a cache... These have spacial and temporal locality. While these would use SRAMs, they can do other things to improve common case performance.
By the way, adding the incrementer would also slow down all aligned accesses.