[uArch Improvement] Address Deduplication
Opened this issue · 0 comments
Background
We currently do a lot of content-addressed lookups into the load and store queues:
- Every load does an SQ lookup for store data forwarding
- Every store does an LQ lookup to find memory order conflicts
These lookups require a phy addr comparison between every queue entry and every incoming load/store. It would be great to reduce the amount or size of these comparisons (especially once we switch to RV64 with even larger phy addresses).
Proposal
Add an extra stage before load/store queues to deduplicate incoming phy addresses. This can be implemented using a deduplication table (indexed with address hashes). If a collision occurs, the system would stall until the old address is no longer in-flight.
In this way, physical addresses can be reduced to perfect hashes (no collisions) within the memory subsystem. Thus, instead of comparing full physical addresses, we could get away with only comparing address hashes of a couple bits.
There are also some more things that could possibly be tracked in the dedup table, like the youngest load for a given address. In this way it might be possible to entirely get rid of the store LQ lookup. Similarly, some store forwarding cases (e.g. only one in-flight store for a given address) could possibly be handled without a CAM lookup in the SQ.
Challenges
Dedup Table Recovery
Should be doable by keeping a log of all changes and then reverting on mispredict. Reverting may take a while though if a lot of loads/stores were invalidated.
Delayed Store Queue lookup
Unless we can come up with any tricks, delaying the SQ lookup by once cycle pretty much implies delaying loads by one cycle. This could possibly be alleviated by tracking loads where SQ lookup is not required & handing them out one cycle early.
Dedup Table Structure
The dedup table would at least have N read ports and N write ports for N incoming AGU_UOp
, which is not ideal if we don't have at least an 1 read N write memory primitive.
The banked CAMs in the LQ and SQ for reference are 1 write, N (content-addressed) reads. Using a memory with too many ports here uses significant resources, so much so it may be better to just stick the CAMs.