kernkonzept/fiasco

x86/vmx: vmm can't cancel event (irq) injection

Closed this issue · 13 comments

An attempt of a VMM to cancel a event (irq) injection is ignored at https://github.com/kernkonzept/fiasco/blob/master/src/kern/ia32/vm_vmx.cpp#L296. The upper bit of Vmx::F_entry_int_info may stay pending (valid) in the VMCS and an now unexpected virtual IRQ gets delivered to the VM as soon as interrupt delivery is enabled again in the VM. This leads in our case to a not booting VM because of invalid guest state VM exits during early boot. I encountered it during our work on a generalized x86 virtualization interface for Genode.

The commit alex-ab/foc@a841138 avoids the issue for us and let boot the VM successfully.

Thanks Alex for bringing this issue to our attention, I will have a look into it.

So far I understand the VMM injected an invent, which was not received by the guest due to closed interrupts, but the VM-entry interrupt-information field in the HW VMCS was written when the first injection request was issued.
Now a bit later, the guest has still closed interrupts, and the VMM issues another vcpu-resume call to the kernel, after it cleared the valid bit in its VM-entry interrupt-information field.
Your expectation is that the kernel should drop the still-pending event injection by revoking the valid bit.

Is that correct?

So far I understand the VMM injected an invent, which was not received by the guest due to closed interrupts, but the VM-entry interrupt-information field in the HW VMCS was written when the first injection request was issued.

Yes.

Now a bit later, the guest has still closed interrupts, and the VMM issues another vcpu-resume call to the kernel, after it cleared the valid bit in its VM-entry interrupt-information field.
Your expectation is that the kernel should drop the still-pending event injection by revoking the valid bit.
Is that correct?

Yes. On seL4 and NOVA this works and there the VM boots up on the kernel-independent VMM.

How do you keep track of events the VMM has written to the VM-entry interrupt-information were injected? Do you expect the VMM to be able to probe the current VM-entry interrupt-information field for the valid bit to be gone before you inject the next interrupt? Are you working with the 'Interrupt-window exiting' (primary Processor-based VM-Execution Controls[2])?

Are there any other requirements for or expectations of the kernel interface in that regard, that I should be aware of?

How do you keep track of events the VMM has written to the VM-entry interrupt-information were injected? Do you expect the VMM to be able to probe the current VM-entry interrupt-information field for the valid bit to be gone before you inject the next interrupt?

According to the Intel spec

VM entry injects an event if and only if the valid bit (bit 31) is 1. The valid bit in this field is cleared on
every VM exit (see Section 27.2).

the specific bit should be off on a VM exit, but maybe there are exceptions of that rule? If the spec is right, than this patch isn't necessary. According to my instrumentation in the kernel, however this bit is on after resume_vm_vmx.

   load_guest_xcr0(host_xcr0, guest_xcr0);
 
   unsigned long ret = resume_vm_vmx(&vcpu->_regs);
+
+  if (Vmx::vmread<Mword>(Vmx::F_entry_int_info) & (1UL << 31))
+    printf("entry bit 31 still on %d\n", ret);
+
   // vmread error?
   if (EXPECT_FALSE(ret))
     return -L4_err::EInval;

Are you working with the 'Interrupt-window exiting' (primary Processor-based VM-Execution Controls[2])?

No, the according bit is off. (Vmx::F_proc_based_ctls & 0x4)

Are there any other requirements for or expectations of the kernel interface in that regard, that I should be aware of?

Nothing special what I could name today. I think the general expectation was/is, that if the VMM reads/writes this VMCS field than this should happen - of course solely if it's not security critical for the host.

That's odd. Thanks for bringing that up, I will investigate this further after my Easter vacation.

Just to let you know, this topic is not ignored. I am working on our testing framework to create a test case for this at some point in the near future.

Sorry for the long wait, but now there is framework support for these kind of tests. I created a test to reproduce the issue, but if the interrupt is injected while the guest has RFLAGS.IF=0 the VM-entry fails due to invalid guest state (Exit 33).
I haven't found the failing VM-entry check yet, but I wanted to provide an update here.

It also seems that I am still doing something different, can you point me to the place in your source code that sets up the VM-execution control registers and maybe a dump of those active when the injection is postponed?

As a side note:
While investigating the issue I found that the VMMs VM-entry interrupt information field, is cleared independently of the kernels VM-entry interrupt info field. See store_exit_info.
So the VMM always perceives the interrupt info as invalid after an VM exit.
However, we are talking about kernel state and there the interrupt info stays valid.

Thanks for taking still care. Unfortunately I can't invest any time to debug it on my side anymore. However I will provide you with the test cases. Under vm images you find the disk images you may boot on native Intel hardware (just dd to a USB stick) or use linux/kvm on a Intel machine, kvm -serial mon:stdio -m 512 format=raw,file=<image_name.img> -machine q35 -cpu host worked for me for all images.

For the images, the VM and the VMM is the same, just the kernel specific part varies (nova, sel4, Fiasco.OC). For Fiasco.OC I provide you 4 images as reference. I think the easiest way for you is just to exchange the kernel in the image yourself and instrument it accordingly. The version we use is referenced here, the used patches and the config

The 0015* patch is for the issue I reported here. The good images are with the patch, the bad images are without the patch. The printf image version additionally print a line when the if clause triggers of the patch.

else                                                                         
    {                                                                          
      // switch off event injection if requested but still pending in hw       
      if (Vmx::vmread<Mword>(Vmx::F_entry_int_info) & (1UL << 31)) {           
        printf("entry_int_info ? \n");                                         
        Vmx::vmwrite(Vmx::F_entry_int_info, irq_info);                         
      }                                                                        
    }

Interestingly, the above message shows up under linux/kvm more often than on real hardware. Hope the infos are useful to you.

(In case you want to build the whole setup yourself, something like make -C build/x86_64 KERNEL=foc run/seoul-net is your starting point - omitting all the pre-requisite for now).

Actually if you are in this situation you've hit this case and the valid bit in the hardward VMCB should be cleared there. In all other cases the valid bit in the interrupt entry information was cleared by hardware.

Thanks for bringing this to mind again. I take time in the new year to look at this. Sorry for dropping this.

I think we are looking at two issues here. The one is the emulation of an external interrupt during event injection to the VMM and the other is the kernel's F_entry_int_info field still being valid after do_resume_vcpu.

It's correct that in the case @chrehrhardt references the VM-entry interrupt info field is not invalidated after moving the data to IDT vectoring info. To my current understanding the VM-entry interrupt info field should be invalid when the IDT vectoring info is valid, given that IDT vectoring info is only valid, if there was a VM-entry and a VM-exit during the event injection.
I haven't found anything in the manual that suggests that VM-entry interrupt info remains valid beyond a VM-exit. The documentation for "VM-exit during event injection" (Intel SDM Volume 3, 26.6.1.2. ) always talks about VM-exits and the 2nd sentence about VM-exits is: The valid bit in VM-entry interrupt info is cleared (27.2). (As @alex-ab is wrote above as well).

Nevertheless, to complete the emulation of an external interrupt event during event delivery in case there are pending interrupts at the vCPU the VM-entry interrupt info should be invalidated.
Thanks to a contribution of @chrehrhardt this will be fixed soon.
A quick fix to test this would be to add write<Unsigned32>(vmcs_s, Vmx::F_entry_int_info, int_info & ~((Unsigned32)1 << 31)); to the if-branch @chrehrhardt references and see if your VMM still fails to boot. This entails that your VMM must check the IDT vectoring info field and act on it.

This (currently buggy) emulation case is the only code path where the VMM should ever perceive the VM-entry interrupt information as valid. If do_resume_vcpu is called, the VMM-visible VM-entry interrupt information is unconditionally invalidated in
store_exit_info().

Now, the "cancel event injection" mechanism strikes me as unnecessary or maybe only necessary if the kernel's F_entry_int_info filed behaves unexpected. (Note my careful wording, it's a complex topic).

This leads my to the question on why the kernel's F_entry_int_info field is still valid after do_resume_vcpu aka after a VM-exit, since - as established above - this should never happen.
Instrumenting my own build after resume_vm_vmx() did not show the VM-entry interrupt info as valid in any case. The code did not change in an relevant area in the last years.

Might it be that the first VM-entry that does not inject the event properly? I'm fishing for potential differences between VMLAUNCH and VMRESUME wrt the event injection, but that's only fishing.

I work on reviving my tests for this issue, unfortunately, they didn't age well.

With regards to the valid F_entry_int_info: I think a failing VM-entry before the injection happens leaves the field in a valid state. I suspect due to the passed time you don't recall if there was a failing entry signaled via ret when the valid bit is still set?

With the next sync of our development version to github next Monday, the fiasco repo contains a commit with Change-Id: I5ed5e53887bf28817b16dd1ed2ffa57bbd6382e4. This fixes the missing invalidation of the VM-entry interrupt information field in case the kernel emulates a VM-exit during event delivery via the IDT-vectoring information field.

IMHO this fixes the observed behavior.