Use cases for CMOs - collect

Question

Use cases for CMOs - collect

AndyGlew opened this issue 4 years ago · 5 comments

AndyGlew commented 4 years ago

We need to collect use cases for CMOs. Want to document.

This issue to collect - brief summary up top. Add more cases in comments. Link o - wiki, email, wherever - more deail

Security related cache flushes for things like Spectre and Meltdown
software managed multiprocessor consistency
non-coherent I/O DMA
persistence
power management, e.g. save to DRAM
save to NVRAM
reliability - e.g. clearing ECC errors
---- ZBB? --------
real time
performance optimization
prefetches
residency, LRU, eviction under program control
reducing bus traffic (DCBA, Inval)

Answer 1 · 2020-10-05T01:38:44.000Z

I've talked to our firmware guys about their cache management use cases. Here's what I've gathered:

The typical cache operations that we use are related to data cache when accessing shared memory with
HW (or possibly another CPU), assuming these are non-coherent, and non-write-thru memory.

The FW would perform writes to the memory, and then Flush Cache to push the dirty writes before
telling the HW to use the memory. We also utilize a barrier/fence after FW updates the data and
before telling the HW is ready for the data. The size and alignment of the access varies from as
small as 4Bytes to as large as ~100-200Bytes.

FW updates memory
FW Flush
FW Barrier/Fence
FW Tells HW (e.g. register write)
HW reads memory

The same thing occurs when HW updates the memory and FW reads it. In that case FW would do
an invalidate instead of flush. Another thing on the FW reading is putting a fence to ensure the CPU
has not pre-fetched and/or predicted the conditions that may be updated by what has been written by the HW.

To add a little more, FW use cases are:

flush the entire cache
flush a specific address range
invalidate the entire cache
invalidate a specific address range

The importance and frequency of these uses cases has some dependencies on the existence of coherency logic
between the processor caches and other HW entities.

[Our current processor] supports 2 modes:

ALL Cache Lines
Individual Cache Line by Physical Address

Today we invalidate by address and range with a FW routine (See API below). What we do within that function
today is loop through all lines that are touched from AddrStart through AddrStart + ByteCount and invalidate
each line. So that means we round down on the start, and round up on the end and invalidate every line in
between.

void InvalidateDataCacheLines( uint8 *AddrStart, uint32 ByteCount )

It would be best for our use case to be able to specific the Start and Length of the range and have the HW
take care of alignment, but we don't get that much flexibility today.

Just to be explicit to your questions:

Can the start address point anywhere (ie - not aligned with a cache line)?
- Yes, anywhere. We don't know/care the cache line alignment in the main application FW.
Can the end address point anywhere?
- Yes, anywhere
What happens if start_address > end_address? Do you wrap? Do you you not do a flush?
- Today the FW routine doesn't even check. I would say, "don't do that" / exception / flush entire cache?
What happens if start_address == end_address? Do you do nothing? Do you do a single address?
- Depending on how we want the API to look, that would be 1 line, or illegal. I know in the PMP regions the TOR is the last valid address (inclusive), so is that how we want this address to be encoded? If so, then Start == End would be a single address
- If we encode the end address as the +1 range boundary (exclusive), then I think Start == End is illegal / don't do that (nop)
- From our current API we take in Length, so this could be Length = 0, which would always result in 1 line

Answer 2 · 2020-10-19T23:54:29.000Z

I've used CMOs to assist coherence in hybrid systems, where some processors are coherent but accelerators are not coherent.

In this scenario, correctness is an imperative, but these applications are using accelerators which means performance is also vital. (Without performance, the value-add of the accelerator is significantly weakened.)

The accelerator may be closely coupled to the processor, even sharing an L2 or a LLC. Alternatively, it may be located more remotely across an NoC.

Cases:

A) range-based WRITEBACK of dirty data from processor cache to a level of cache that may be shared with the accelerator; if none, then writeback to main memory.

B) range-based EVICT of any clean or dirty data from processor caches, all the way down to (but excluding) the level of cache shared with the accelerator.

C) range-based WRITEBACK + EVICT (combination of A and B)

Since this is done by the application, ranges are based on virtual addresses. The virtual address ranges may span large blocks of data (eg, large matrices), even though only a small fraction of it may be held in the cache (eg, 4GB matrix and 32kB primary cache).

Case A ensures the accelerator will read the latest data. The processor may have modified or initialized the data or a portion of the data. It can be difficult to track which portions of the data have been modified, so large ranges are the norm (to be safe and avoid tracking overhead).

Case B ensures that writes done by the accelerator will be visible to the processor. That is, it removes stale data from the processor so the processor will get the latest copy. Some processor implementations may be able to snoop on external writes, but this assumes the accelerator connection is located nearby (tightly coupled?) and observable by the processor (not true in most NoC).

In Case C, a programmer may sometimes wish to combine this writeback with an evict of clean data; this can save them a step from also doing Case B later when those address ranges overlap.

Workarounds involve changing caching policies. I see two cases here:

the ability to mark virtual address ranges (pages?) as write-through instead of write-back
the ability to mark virtual address ranges (pages?) as uncached

Case 1 eliminates the need to perform Case A, but retains the need to perform Case B.

Case 2 eliminates the need to do all Cases (A, B, C and 1), but can be a lower-performance option.

Note that Case A, B and C are historic (affecting data already in cache), whereas Case 1 and 2 are forward-looking (affecting what will be placed in the cache in the future). As a result, the CMO TG may choose to only tackle Cases A, B and C, but pass along Cases 1 and 2 to a TG charged with looking after VM and PMA. Very likely, the CMO-TG already has to tackle A, B and C no matter what (to fix up what has been placed in the cache already).

MIPS, for example, has the ability to cover Case 1 and 2 by updating caching policies for each page in its page tables.

Aside: Case 1 and 2 do operate similarly to range-based CMOs. However, they will have to iterate over page tables rather than cache lines. Is there anything we can learn about range-based CMOs when we think of these cases? (eg, if we want similar interfaces to both features, it is better to plan ahead)

Using a trap to perform any of these Cases (A, B, C, 1 or 2) will impact the effectiveness of the accelerator. For Case 1 and 2, the operations can be done ahead of time (it implies that programmers should not change such properties with fine granularity). In contrast, Cases A, B and C will always be placed inline with performance-oriented code.

Answer 3 · 2020-11-10T20:27:50.000Z

Another use case: changing the cacheability/memory type, e.g. via the PTEs or PMAs.

It looks pretty likely that the virtual memory group is one or a few going to provide a few bits to specify memory type in the page table entries. also, in some systems the PMAs ( physical memory attributes) may change dynamically. (The presence of the PMAs is assumed by RISC-V, but as far as I know there is no ice the definition of PMA registers or formats, although some aspects may be associated with PMPs, pending on how issues related to speculation when paging is disabled are resolved.)

obviously, if a page that was cacheable, WB/WT, is changed to uncacheable UC, there may need to be cache flushes.

BTW, This is a good example of a cross tech group issue that should probably be put in JIRA. I will finish writing it here and then consider moving it if I can figure out where to move it to.

BTW, this use case immediately raise a question, that should probably move to a new issue.

TBD: complete this issue... my PC needs to reboot :-(

Basically, if we transition

WB --> UC directly must be ready for possibility of UC accesses to memory that is in cache. Many systems dislike. (Mostly if a UC speculatable type; perhaps not if non-speculatable type or mode.)

If we transition indirectly, break before make WB-->invalid-->UC, then --- our current POR is for CMOs to use virtual addresses. But if invalid... need physical?

Maybe WB-->UC-non-spec --> UC-spec? But then Virt Mem TG needs to have UCS and UCNS mem types.

===

Fewer issues with UC-->WB/WT, but there will be some related to memory ordering and transactions in flight.

Answer 4 · 2020-11-10T21:51:36.000Z

Note that ARM pretty much solves this issue and provides recipes for both the TLB and cache management. The key is that CMOs ignore the cacheability attributes and always check caches.

Another issue you bring up is

"our current POR is for CMOs to use virtual addresses"

I think this should be stated that CMOs use effective addresses, VM is orthogonal in my mind. Defining CMOs to use effective addresses side-steps all the mode stuff, including VM, virtualization, etc.

Edit here: Additionally, we need to define these attribute transitions in the context of the PMAs. There are no defined "memory types" like other architectures.... :(

Answer 5 · 2020-11-17T22:49:10.000Z

Agreed: CMOS must ignore the cacheability attributes.

That;'s not what I was talking about/

PTE.WB --> PTE.UC

Break before make: PTE.WB --> PTE.INV --> PTE.UC

when do we flush?

write PTE(EA).WB --> PTE(EA).INV
shootdown
CBO.FLUSH  <---  cannot have a translation so cannot EA/VA 
write PTE(EA) from INV  --> PTE(EA).UC

write PTE(EA).WB --> PTE(EA).UC
shootdown
CBO.FLUSH  
==> object UC and WB for the same address

Could be the same as a DMA I/O

CBO.CLEAN ---- evict dirty
write PTE(EA).WB --> PTE(EA).UC
shootdown
CBO.FLUSH <-- get rid of stale