[uArch Improvement] Address Deduplication

Question

[uArch Improvement] Address Deduplication

Opened this issue 4 months ago · 0 comments

Background

We currently do a lot of content-addressed lookups into the load and store queues:

Every load does an SQ lookup for store data forwarding
Every store does an LQ lookup to find memory order conflicts

These lookups require a phy addr comparison between every queue entry and every incoming load/store. It would be great to reduce the amount or size of these comparisons (especially once we switch to RV64 with even larger phy addresses).

Proposal

Add an extra stage before load/store queues to deduplicate incoming phy addresses. This can be implemented using a deduplication table (indexed with address hashes). If a collision occurs, the system would stall until the old address is no longer in-flight.

In this way, physical addresses can be reduced to perfect hashes (no collisions) within the memory subsystem. Thus, instead of comparing full physical addresses, we could get away with only comparing address hashes of a couple bits.

There are also some more things that could possibly be tracked in the dedup table, like the youngest load for a given address. In this way it might be possible to entirely get rid of the store LQ lookup. Similarly, some store forwarding cases (e.g. only one in-flight store for a given address) could possibly be handled without a CAM lookup in the SQ.

Challenges

Dedup Table Recovery

Should be doable by keeping a log of all changes and then reverting on mispredict. Reverting may take a while though if a lot of loads/stores were invalidated.

Delayed Store Queue lookup

Unless we can come up with any tricks, delaying the SQ lookup by once cycle pretty much implies delaying loads by one cycle. This could possibly be alleviated by tracking loads where SQ lookup is not required & handing them out one cycle early.

Dedup Table Structure

The dedup table would at least have N read ports and N write ports for N incoming AGU_UOp, which is not ideal if we don't have at least an 1 read N write memory primitive.

The banked CAMs in the LQ and SQ for reference are 1 write, N (content-addressed) reads. Using a memory with too many ports here uses significant resources, so much so it may be better to just stick the CAMs.