tandasat/HyperPlatform

Is vmxoff path really safe/correct?

ionescu007 opened this issue · 26 comments

Hi,

I experimented with turning VMX off by using DPC routine targeted to each processor, with synchronization barrier. If DPCs were delivered as part of SYSTEM process (such as from idle thread, or from system thread), everything worked fine. However, if a process was currently running, and the DPC interrupted the process, many times the process crashes with invalid memory access, or sometimes a kernel driver itself crashes (such as win32kfull running on the context of dwm.exe), when accessing session memory.

These problems go away if doing VMX off the way you do it, by sticking with a single thread that is already running under SYSTEM process. However, even if you have a comment that suggests you saw some strangeness.

So therefore, I have two questions:

  1. Are you sure that all the register state is correctly saved and restored? You only save integer registers, but not xmm0,1, 2, 3, 4, 5 on x64, which are volatile registers (and neither the other xmm registers). Why do you only save integer registers during vmm_exit and vmxoff restore? Shouldn't FPU registers be saved too?

Must vmx_launch and vmx_off be done in the same address space? In other words, will vmx_off restore the CR3 that was saved during vmx_launch? Because if so, I can understand the danger of capturing CR3 in system process (session -1), but then doing vmx_off in dwm.exe (session 0). The totally different CR3 will certainly cause a crash. Does this mean that it's only safe to do on/off from within SYSTEM process?

Thank you.

Thank you for the through description of questions.

As for the second question, VMXOFF does not change a value of CR3. However, it does not mean that VMXON and VMXOFF can only be done on the SYSTEM process. As long as a value of CR3 after VMXOFF remains the same as that of before VM-exit, it will be fine.

In case of HyperPlatform, it simply sets the same CR3 value for both the SYSTEM process (VmcsField::kGuestCr3) and the VMM (VmcsField::kHostCr3), and does not restore the CR3 that was being used at the time of VMCALL after VMXOFF. For this reason, VMXOFF can only be safely performed from the SYSTEM process in current implementation. The following are changes of CR3 w.r.t. VMXOFF:

  1. Before VMCALL (CR3 == VmcsField::kGuestCr3)
  2. VMCALL causing VM-exit
    CR3 is overwritten by a value stored in VmcsField::kHostCr3 during transition between VMX non-root operation, a guest context, to VMX root operation mode, a VMM context (CR3 == VmcsField::kHostCr3)
  3. VMXOFF (CR3 == VmcsField::kHostCr3)
  4. After VMXOFF, a guest context (CR3 == VmcsField::kHostCr3)

So, if VMCALL leading to VMXOFF is executed with a different CR3 from that of VmcsField::kHostCr3, this would cause crashes, as an original CR3 is not going to be restored even after VMXOFF.

Does this answer your question clear enough?

As for the first question, I do not have an answer right now, but it sounds good pointing out. I will look into it and will reply here shortly.

Thank you,

As for the restoring registers, you are right. I believe that the VMM should save and restore non-interger registers too if they can be modified by the VMM since they are not automatically saved and restored by a processor on VM-exit and VM-enter.

In case of HyperPlatform, it does not appear to be an issue because those unhandled registers are not modified by the VMM; or VM-exit does not happen under a situation where consistency of those registers is required. For this reason, I would hold off updates unless an concrete issue is seen.

Thank you,

Hi,

Thank you for the explanation on CR3. That makes a lot of sense. It looks like the requirement for SYSTEM could go away, if, after the vmcall, the CR3 would be captured, and overwritten in the kGuestCr3 field before the vmxoff. But I don't think it's worth doing that.

As for non-integer registers, on x64, the calling convention does support the compiler auto-generating function calls which use xmm registers. This will happen automatically in certain cases, in other cases, one must use __vectorcall as the calling convention to force this. Because of this, it is certainly possible for someone to have a vector function that is using volatile XMM registers, and hit a VMM Exit. Because you are using Visual Studio and because Windows Kernel x64 allows XMM registers, I think it's technically possible for hyperplatform.sys to end up using XMM registers -- for example if a memcpy becomes inlined + vectorized.

So I think this is taking a risk that certain optimization settings/further code might hit. It may be OK now, but not if someone adds more routines without realizing them.

Thanks for your insight!

Thank you for sharing your knowledge on use of XMM registers. I was unaware of most of them.

I agree with your thoughts on potential risk of not fixing it. While the issue may not hit at this time, it would be tough to figure out a bug is caused by XMM registers for someone who are not aware of it. Given the fact that this project is meant to be a platform, fix should be considered. For now, I have created #5 for reference and potential future work.

I appreciate your thorough risk assessment and suggestion.

while working on bareflank.org, I ran into a similar issue with XMM, as I do believe they need to be addressed. The old vmxcpu code never saved these registers because it only ran on 32bit, which did not included XMM registers in the ABI, which is why we never included them in the MoRE hypervisor (as it's based on vmxcpu). On Bareflank, we use Libc++ which touches XMM and FPU registers, so these absolutely need to be saved in our case. Our current solution is to use the fxsave and fxrstor functions. In our state save area. Figured this info might help you guys in the future if you choose to fix this problem. Here is our PR that adds this code if you want an example

Bareflank/hypervisor#123

I highly recommend you use XSAVE/XRESTORE if available, otherwise you risk to miss certain registers and have hard to debug issues :)

  • Assuming LibC++ might use such registers (if you start using AES-NI or AVX).

Thanks for sharing your findings. While I am not entirely familiar with the extended states, looks like an XSAVE feature set is more comprehensive ones, as far as I can see from this slides from Intel.

I originally thought a chance of hitting this issue was small and negligible, but it does not seem convincing anymore since Libc++, which one of the most common libraries, is affected. Libc++ may not going to be used inside HyperPlatform, but I would not surprise if other libraries triggered the same issue.

I will study the extended states first (ie, supported processor generations, difference between XSAVE, -OPT, -S and -C, and any implications) and then will work on implementation for fix.

Thanks for the advice, yeah we will be using XSAVE as well. The PR that is there will be updated with a couple of additional fixes as I found a nasty issue with -O3 as GCC uses SSE a lot with the heavy optimizations enabled. The problem is, GCC assumes the stack is 16 byte aligned, and so we were seeing segfaults as a result. The other thing I was missing was a restore on "promote" when coming out of VMX-non root. I should have my fixes in today for these.

As far as preservation of x87, SSE and AVX are concerned (and that assumption is fine I think--need better understanding), the simplest way seems to use XSAVE and XRESTORE. It will require only checks for:

  • CPUID function 1:ECX.XSAVE[bit 26] == 1 to check support of the XSAVE
  • CPUID function 0DH, sub-function 0 to check if AVX state is supported

If either of them is not supported, using FXSAVE and FXRSTOR is the cleanest solution as they "save and restore the states of the XMM registers and the MXCSR register (SSE state), along with x87 state" (FXSAVE AND FXRSTOR INSTRUCTIONS), or RtlCaptureContext() as they seem to save SSE state too.

Some notes below.

Terms and Basics (XSAVE-SUPPORTED FEATURES AND STATE-COMPONENT BITMAPS)

  • Each processor state managed by the "XSAVE feature set" is called state. For example, x87 state for the x87 FPU execution environment, SSE state for the streaming SIMD extensions.
  • States are referred to as "state components," and managed by "state-component bitmaps."
  • A state-component bitmap comprises 64 bits; each bit in such a bitmap corresponds to a single state
    component. For example, bit 0 corresponds to the x87 state and bits in the range 62:10 are not currently defined.
    • The XSAVE feature set takes state-component bitmaps from in EDX:EAX, and that value is called the "instruction mask,"
    • If the bit corresponding to a state component is clear in XCR0, instructions in the XSAVE feature set will not operate on that state component, regardless of the value of the instruction mask
  • "user state components" can be managed by the entire XSAVE feature set, and XCR0 contains a state-component bitmap for them
  • "supervisor state components" can be managed only by XSAVES and XRSTORS, and IA32_XSS MSR contains a state-component bitmap for them
    • All the state components corresponding to bits in the range 9:0 are user state components, except PT state
  • The "requested-feature bitmap (RFBM)" is a bitmap to be actually specify which components should be saved or restored.
    • for user state components, RFBM == EDX:EAX & XCR0
    • for supervisor state components, RFBM == EDX:EAX & (XCR0 | IA32_XSS)
  • The XSAVE feature set allows saving and loading processor state from a region of memory called an "XSAVE area".

Availability (ENUMERATION OF CPU SUPPORT FOR XSAVE INSTRUCTIONS AND XSAVESUPPORTED FEATURES)

  • If CPUID function 1:ECX.XSAVE[bit 26] == 0, no support of the XSAVE feature set at all
  • If CPUID function 1:ECX.XSAVE[bit 26] == 1,
    • at least XGETBV, XRSTOR, XSAVE, and XSETBV are supported, and
    • CR4.OSXSAVE can be set to 1
  • CPUID function 0DH, sub-function 0 returns
    • supported "components," and
    • a size of an XSAVE area for all the user state components supported by this processor
  • CPUID function 0DH, sub-function 1 returns availability of
    • XSAVEOPT, which is simply an optimized version of XSAVE
    • XSAVEC, which uses "compaction extensions" (ie, uses the "compacted format" for the extended region of the an XSAVE area) that are supposedly efficient than standard one (the "standard format").
    • XSAVES, which saves the "supervisor state components" apart from the "user state components".

Data structures (XSAVE AREA)

  • The "legacy region" comprises the 512 bytes starting at the area’s base address. It is used to manage the state components for x87 state and SSE state
  • The "XSAVE header" comprises the 64 bytes starting at an offset of 512 bytes from the area’s base address
  • The "extended region" starts at an offset of 576 bytes from the area’s base address. It is used to manage the state components other than those for x87 state and SSE state

I am writing the code right now so I'll update the PR to bareflank that has at least what we are working on. With any luck I should have ours fixed by tonight. The approach we are taking is assuming that XSAVE is available as we only support SandyBridge and above. Once you have XSAVE, you can get the size of the save area by doing a

mov rax, 0x0D
cpuid
mov <state_save_size>, rcx

One thing I learned (took a bit to debug) was that xrstor will GPF if you don't zero out this memory. Since we have multi-core support, we do a lot of work on the stack to keep things thread-safe, and so not clearing the state save area was causing bits in the state save area to randomly trigger GPFs on restore. Also don't forget to make sure the state save area is 64byte aligned.

The harder part is how to handle XSETBV and CPUID. There does not seem to be a way to fake the result of XGETBV. At least, a call to XSETBV always traps, so we can ignore writes that disable bits we are using (will likely add MPX support in the future). Not really sure how best to handle XGETBV, but hopefully Windows (as Linux is ok) just enables all of the XCR0 bits and leaves it alone.

Ok, I updated the PR. It's not done yet as I still have to plum in the rest of the entry points, and I need to add the XSAVE logic to the exit handler, but this at least shows one way to use it

https://github.com/rianquinn/hypervisor/blob/compiler_flags/bfvmm/src/misc/src/execute_entry.asm#L88

Hi,

Note that for Windows, there are APIs such as RtlGetProcessorExtendedFeatures and KeSaveExtendedProcessorState/KeRestoreExtendedProcessorState which should take care of most of this.

Thank you for APIs, Alex. I did not realize XSTATE_MASK_GSSE meant the AVX state. Those functions are very helpful. I will write tests and update code shortly.

@rianquinn, thanks for sharing gotchas and code. Thanks to API provided by Windows, I am likely to be able to avoid direct use of the XSAVE feature set instructions. As for XCR0, if I recall correctly, I hardly saw VM-exit via XSETBV on Windows, perhaps, only at the time of system shutdown or something under a non-regular situation. So it may not be a big issue on Windows too.

I was going to use APIs, but KeSaveExtendedProcessorState() could be used only when IRQL <= DISPATCH_LEVE, while VM-exit could be executed at any IRQL. I will use xsave/xrestore directly once availability of those instructions and AVX support is validated with RtlGetEnabledExtendedFeatures() (for some reasons, I did not find declaration of the API and had to write a prototype of it manually).

@rianquinn, you may be able to use intrinsics functions for xsave/xrestore etc depeding on compiler's support.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_xsave

At least MSVC supports them well enough for my porpose.
https://msdn.microsoft.com/en-us/library/hh977022.aspx

Sadly in my case, the entry point for both the Kernel -> C++ and the entry point for the exit handler need to start with a simple ASM block as I have to control each register access very carefully, so I have a bit of hand written ASM to pull that off. But yeah... One thing that I will likely work on at some point is porting some of the intrinsics code that we do have to the Intel calls, as GCC does support these, and reducing the amount of ASM code we have would be a good thing. The code I linked above not only has to run xsave, but it also has to do MS64 ABI to System V ABI conversions (not in there yet), so some of that ASM work that you see is preperation for that code.

I have not faced an issue and found documents w.r.t zero-ed out memory on XRSTOR either. Instead, I had to debug a bit to noticed that XSAVE/XRSTOR raises an exception when CR0.TS[bit 3] is 1. I only saw this situation on x86 Windows for now. As workaround, code like below worked.

  Cr0 cr0 = { __readcr0() };
  cr0.fields.ts = 0;
  __writecr0(cr0.all);
  _xsave(stack->processor_data->xsave_area, stack->processor_data->xsave_inst_mask);

It is kind of ugly though.

In our case, we only support 64bit, so we don't have to worry about TS as it has no purpose on x86_64 since task switching is not available. That being said, Windows doesn't use x86 tasks, so you should be ok disabling it.

Actually Windows (like Linux and OS X) uses tasks for Double Faults and
NMIs, the latter which are recoverable.

Additionally, video mode switching is implemented, in some cases, with the
help of INT 10 / VIDEO ROM code running in V8086 mode, with a special TSS
that has an IOPM to allow direct I/O access. All of the above may react
badly to turning of the TS bit.

Satoshi: IRQL requirements don't mean much when running in VMM context. For
example, even if you enter at PASSIVE_LEVEL, you are technically at
HIGH_LEVEL for all intents and purposes, as interrupts are disabled. So by
that analogy, you should be calling almost zero Windows APIs, except if
they are documented to run at HIGH_LEVEL.

Best regards,
Alex Ionescu

On Mon, Jul 4, 2016 at 11:16 AM, Rian Quinn notifications@github.com
wrote:

In our case, we only support 64bit, so we don't have to worry about TS as
it has no purpose on x86_64 since task switching is not available. That
being said, Windows doesn't use x86 tasks, so you should be ok disabling
it.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFxIeDHfyJGM0aI6IzjxLQJtLdc3WP4wks5qSU33gaJpZM4HtbQh
.

Thank you for the note, Alex.

I do not think I understand why you cannot call API when interrupts are disabled. The eflags.IF is cleared when VM-exit happened but the IF only affects hardware interrupt, and exceptions can still occur. I tested that even page-in was processed fine in the VMM-context if IRQL is PASSIVE_LEVEL. To my knowledge, the IRQL requirement mostly stems from if page-in can be processed--in other words, if the process can enter wait state--, and interrupts are irrelevant.

I bet that I am missing something and appreciate if you could explain a bit more about why disabling interrupts is technically the same as being IRQL==HIGH_LEVEL.

A test program attached. ConsoleApplication1.zip

Hi Satoshi,

Wow -- I cannot believe you were crazy enough to try page-in from VMM
context! Let me explain to you why VMM context == HIGH_LEVEL :-)

  1. Is it safe for you to be context switched by the OS while in the middle
    of VMM mode? Of course not... So you are at least at DISPATCH_LEVEL. Is it
    safe for you to "wait" on an object while at VMM mode? Of course not -- you
    would be context switched to another thread/idle thread which would now be
    running as VMM Host!!!

  2. Is it safe/OK for you to receive DPCs while in the middle of VMM mode?
    Again, of course not. Another reason why you are at least at
    DISPATCH_LEVEL. Could you receive a DPC, even if you wanted to? Nope --
    receiving a DPC requires an interrupt, and IF is off, so Local APIC will
    never deliver it

  3. Will you receive any Device Interrupts? Nope, because EFLAGS IF is off.
    Would you want to be interrupted in the middle of VMM mode? Also nope. So
    you are at least at MAX_DIRQL.

  4. Will you receive the clock interrupt? Nope (also why you hit a CLOCK
    WATCHDOG BSOD sometimes)... So you are at least at CLOCK_LEVEL.

  5. Will you receive IPIs? Nope, because IF is off, so LAPIC will never send
    them. You also probably don't want to be running IPI while inside VMM
    host... So you are at least at IPI_LEVEL.

  6. Technically because you are not in the middle of handling an IPI, but
    rather you've disabled interrupts completely, you are at IPI_LEVEL + 1, aka
    HIGH_LEVEL.

In other words, if you call, for example, ExAllocatePoolWithTag, and this
is PAGED POOL, you can get unlucky and this will require page-in which
requires blocking your thread, and now, some other thread will run in VMM
host mode... Sure, you can get lucky and control will come back to you, but
this is insane... If you request NON PAGED POOL, it will "appear to
work"... And then in one situation, a TLB flush will be required, which
sends an IPI... Which can't be delivered... And so it will hang. Etc.,
etc...

Hope this makes sense.

Best regards,
Alex Ionescu

On Mon, Jul 4, 2016 at 9:21 PM, Satoshi Tanda notifications@github.com
wrote:

Thank you for the note, Alex.

I do not think I understand why you cannot call API when interrupts are
disabled. The eflags.IF is cleared when VM-exit happened but the IF only
affects hardware interrupt, and exceptions can still occur. I tested that
even page-in was processed fine in the VMM-context if IRQL is
PASSIVE_LEVEL. To my knowledge, the IRQL requirement mostly stems from if
page-in can be processed--in other words, if the process can enter wait
state--, and interrupts are irrelevant.

I bet that I am missing something and appreciate if you could explain a
bit more about why disabling interrupts is technically the same as being
IRQL==HIGH_LEVEL.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFxIeLDFbUDm4AYlTbJIfPnugLNrwIxTks5qSdu6gaJpZM4HtbQh
.

Wow, thank you for the clear explanation. I understand what you meant now.

I knew that VMM context should not be context switched and that it did not receive the clock interrupt and IPI, but was not able to think of a connection between IPI and interrupt. It is obvious that IPI relies on a hardware interrupt. And if IPI did not work, TLB flush would not work neither.

While I started to write an initial version of VMM, I noticed that calling API from VMM-context made the VMM noticeably unstable or caused funny symptoms like this even if I obeyed requirements of IRQL. So I decided to avoid using them as much as possible, but I was not able to explain well how it could cause trouble. The scenario of ExAllocatePoolWithTag + NonPagedPool is a such good example.

By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :)

By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :)

Second that!!!

Oh yeah, I meant "good crazy", not "stupid crazy" ;-)

Best regards,
Alex Ionescu

On Tue, Jul 5, 2016 at 9:57 AM, Rian Quinn notifications@github.com wrote:

By the way, the test for paging-in was needed to make my question clearer
to ask you for explanation. Besides, such experiments are always fun :)

Second that!!!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#3 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFxIeLc4xH5PjiBEmg0Ig-D_pQmujB5hks5qSozkgaJpZM4HtbQh
.

By the way @tandasat there is new Windows 10 API: KeGetEffectiveIrql() which will return HIGH_LEVEL at all times while in VMM mode ;-) Proving my point :)

That is beautiful :) I should probably document that fact.