PV MMU Design
Opened this issue · 1 comments
Background
For shadow MMU, in order to intercept modifications to the guest page table, all guest page tables are write-protected. This means that each modification will trigger a #PF VM-exit and an instruction emulation, resulting in poor performance when the guest frequently modifies the page table, such as when the guest application frequently allocates and frees memory.
Even with synchronized and unsynchronized pages
optimization, the L1
guest page table is allowed to be writable after write page fault and is made write-protected again when the guest performs TLB flushing. This reduces the #PFs and emulations when the guest modifies multiple gptes in the L1
page table. However, this also has some drawbacks, including the following:
- It is necessary to mark all upper level SPs as unsynchronized when allowing the
L1
SP to be unsynchronized. This may increase the lock hold time when there are multiple processes in the Linux guest, all of which share the kernel mapping PGDs. A modification to the kernel mapping will mark all root SPs as unsynchronized, and each root SP will need to be synchronized again when it is loaded, as root SP synchronization only marks the current root SP as synchronized. INVLPG
emulation only synchronizes modified gptes, but the SP is still marked as unsynchronized. If the guest usesINVLPG
to do TLB flushing one by one, then only one vCPU needs to perform spte synchronization, but other vCPUs usually do nothing after acquiring the MMU lock. This increases the MMU lock contention.
Purpose
Firstly, not all page table modifications need to be notified; only changes that require TLB flushing later are necessary, meaning only permission demotions are needed.
Secondly, inspired by the synchronized and unsynchronized pages optimization, page table modifications do not need to be immediately notified to the hypervisor. They can be delayed until TLB flushing later. This allows the guest to cache the modifications and commit them together during TLB flushing, reducing the number of notifications (hypercalls).
Finally, without write protection, the guest needs to notify the hypervisor to free the SP when the guest frees the page table. Otherwise, KVM will reach the SP limit quickly and reclaim the SP frequently, leading to bad performance.
Design
-
Pagetable operations
All pagetable modifications in guest need to use the specific PTE operation functions.- set_pgd/set_p4d/set_pud/set_pmd/set_pte
These functions should be used when the guest modifies the PGD/P4D/PUD/PMD/PTE entry. If the modification requires a TLB flushing later, the gptep could be cached for later commitment. - release_pgd/release_p4d/release_pud/release_pmd/release_pte
These functions should be used when the guest free the page table memory.
- set_pgd/set_p4d/set_pud/set_pmd/set_pte
-
Lazy mode
Follow thesynchronized and unsynchronized pages
design, guest can cache the modified gpteps in its ring buffer during the pagetable modification. In TLB flushing, all cached gpteps are committed to the hypervisor, hypervisor can synchronize the associated spteps directly.-
Global ring buffer vs Per-CPU ring buffer
Although TLB is CPU scoped, the page table is shared between all CPUs. Therefore, the ring buffer should be global, as when one CPU attempts to perform TLB flushing, all cached gpteps should be committed, and spteps could be synchronized. Then the CPU can see the PTE changes by other CPU and the updated shadow page table.A global ring buffer complicates things and requires protection semantics. Therefore, we are considering a Per-CPU ring buffer, where each vCPU caches the modified gpteps in its ring buffer. The vCPU should commit the cached gpteps first before sending IPIs to other vCPUs to shootdown TLBs.
However, this only works when PTE modification and TLB flushing are atomic. There are some TLB delay mechanisms (e.g.,
mmu_gather
andbatched unmap TLB flush
) in the Linux kernel memory management, which requires some extra changes to work correctly. When one CPU delays TLB flushing and marks TLB flushing as pending, another CPU in themunmap/mprotect
path will issue the TLB flushing pending status if it sees the PTE is not present. If it sees the TLB flushing pending, it will attempt to do TLB flushing for this memory. However, the CPU that modified the page table may not load the memory, so it can't receive the TLB flushing IPI, meaning the cached gpteps can't be committed, and the shadow page table is not updated. As a result, this CPU will still see an outdated TLB. Therefore, the CPU that modifies the page table of another memory instead of the currently loaded memory should commit the gpteps immediately, or the CPU should set itself to the TLB flushing IPI target CPU range, so the CPU that needs to do TLB flushing can send the IPI to it.
-
-
Detect and Setup
- Detect:
- CPUID
KVM_FEATURE_PV_MMU
- CPUID
- Setup:
- MSR_KVM_PV_MMU
A new virtual MSR is used to record the GPA of the per-CPU ring buffer. One bit is used to indicate that the PV MMU mode is enabled.
- MSR_KVM_PV_MMU
- Detect:
-
Ring buffer
For simplicity, the buffer in the first version is implemented as a simple buffer instead of a ring buffer, such as a perf ring buffer.
#define PV_MMU_PTEPS_BUFFER_LEN (PAGE_SIZE / sizeof(u64))
struct pv_mmu_buffer {
u64 pteps[PV_MMU_PTEPS_BUFFER_LEN];
}
/*
* Only permissions demotions need to be notified, and only 3 bits are available in the gptep.
* P -> NP
* It also includes changes to the page frame number (PFN), dropping the access bit, setting the reserved bit,
* and transitioning from User (U) to Supervisor (S)
*
* RW -> RO
* It also includes dropping the dirty bit.
*
* X -> NX
*/
#define PV_MMU_SET_PTE_NP _BITUL(0)
#define PV_MMU_SET_PTE_RO _BITUL(1)
#define PV_MMU_SET_PTE_NX _BITUL(2)
- Hypercall
- KVM_HC_PV_MMU_SET_PTE
Notify the hypervisor about the GPTE update.- a0: the start index within the buffer
- a1: the count of cached gpteps within the buffer
- a2: the TLB flushing related flags
#define PV_MMU_FLUSH_TLB_CURRENT _BITUL(0) #define PV_MMU_FLUSH_TLB _BITUL(1)
- KVM_HC_PV_MMU_RELEASE_PT
Notify the hypervisor that GPT would be released.- a0: the gpa of GPT
- KVM_HC_PV_MMU_SET_PTE
- Guest
- PVOPs
- set_pte
- release_pte
- lazy_mode
- start_context_switch/end_context_switch
- PTE modification
All PTE modifications should use theset_pte
PVOPs after enabling PV MMU mode. However, some places in the Linux guest do not follow this, so we need to change them.- ptep_get_and_clear
- ptep_set_wrprotect
- ptep_test_and_clear_young
- TLB shootdown
The cached gpteps in the buffer should be committed before sending the TLB shootdown IPIs.- inc_mm_tlb_gen
- flush_tlb_all/flush_tlb_kernel_range/arch_tlbbatch_flush
- PVOPs
- Hypervisor
- PV MMU mode
The PV MMU mode is an enlightened shadow MMU mode. After the guest enables it, write protection for the guest page table and synchronized/unsynchronized SP are dropped. - SPTE synchronization delay
During the KVM_HC_PV_MMU_SET_PTE hypercall, all committed gpteps are cached in the VM global buffer, and spte synchronization is delayed when the guest needs the TLB flushing, which iskvm_mmu_sync_roots()
.
- PV MMU mode
- Debug
- Problem
How to intercept all PTE modifications if someone forgets to use the previous operations when there is no write protection for the guest page table.
- Problem
- Is there any POC level code to show the details of your design? Especially on PTEPS_BUFFER.
- Is there any performance data to help us buy-in the design ?
- How does this PV MMU mode coexist with legacy shadow MMU mode, e.g. handling host/guest shared memory or DMA buffers ?
- How do ensure guest spte consistency in the pvm-mmu context when multiple vcpus triggering the TLB synchronization delay mechanism aiming at the same address space parallely ?