tandasat/SimpleSvm

system hangs shortly after virtualizing processors

NickTsl opened this issue · 9 comments

Hello, thank you for this simple repository, it is really helping me learn all about virtualization on AMD processors. I downloaded your hypervisor, and recoded many parts to gain a better understanding on how things work. Right now, I am having an issue where the entire system hangs a few seconds after vmrun. I found out that it always hangs at PAUSE instruction after sending an IPI request (KiIpiSendRequest) in an ntoskrnl function. I have attached two images below, showing exactly where it hangs inside ntoskrnl.

image

image

What could possibly be the reason behind this?

Thanks in advance.

It appears to be the IPI is not processed by one or more processors.

Few questions.

  • Is this happening with unmodified SimpleSvm or your modified code? If the former, can you share the details of the target system so I can attempt to repro if needed.
  • Perhaps, are you calling any NT exposed APIs while handling #VMEXIT? If this is the case, stop doing that and see if that makes a difference. If two processors are concurrently handling #VMEXIT and issue IPIs, it causes dead lock because I do not think IPIs are delivered to core on #VMEXIT.
  1. This happens with my modified code
  2. The hang happens in guest state, outside of VMEXIT handler so that shouldn't be a problem, but I am not calling any APIs in VMEXIT.

Is it possible that it has something to do with incorrectly setting up IDTR and segment attributes?

Diagnosing an issue without code is not going to be easy for me. I offer several debugging tips for you can hopefully figure it out instead.

  • Disable #VMEXIT as much as possible and enable them one by one to see where the problem is
  • Where doable, port parts of your code to SimpleSvm to test if that causes the issue
  • If the issue is happening on baremetal but not on VMware, suspect the memory related issues such as TLB flushing, memory attribute configurations in NPT. Try without enabling NTP.
  • Does this happen on a single processor system, or multicore only? If it is happening on multi-core system, are all processors stuck at the pause instruction?
  • Consider logging the reasons of #VMEXIT on memory so you can inspect the contents on system hang. That let you know how VMM intercepted system activities before system hangs.

If the critical structures like IDTR is incorrectly setup, it could cause the problem like that, but since the system works for few seconds, it do not expect that's the cause. It more likely causes problems immediately unless the mistake is very subtle.

Thank you for the detailed response, I will keep you updated on my testing when I get home.

I didn't fix the problem yet, but I managed to get more information about the error. I also ported some code, from your code to my fork of your repository, but that still resulted in the same problem. The freeze happens no matter if the system is single core or multi core. When I restrict my virtual machine to one core, I can't break into the virtual machine when it hangs, at all, so I am testing with 2 cores. I switched to the second core, to view registers and call stack, and the call stack of second core looks pretty interesting.

core 1 callstack:
image

core 2 callstack:
image

I'll upload my code below in a ZIP, If you could take a look at it I would greatly appreciate it. Thanks in advance.

MyFirstHypervisor.zip

I compile with: x64, Debug

Hi, thank you for sharing code and more details. Please try those:

  • remove DbgPrint that is called within (and under) HandleVmExit. It is an NT API and non-trivial that can likely trigger IPI itself
  • If you have VMware Workstation Pro, Player, or Fusion, set up gdb-stub debugging and see where the frozen processor when into this situation by inspecting RSP and looking into return addresses in it. The detailed instructions for setting it up are found in this post.

I removed all dbgprints, no change in the result. It appears that the issue doesn't have anything to do with interrupts, I inspected core 2 and found out that it is constantly causing a guest page fault in an infinite loop.

The page fault address printed in the picture below is the output from Exitinfo2
image

The first core is executing normally without any problem.

EDIT: I fixed this page fault loop, I might have fixed the freezing problem, I'll post updates soon and I think I'm on the right track to debug this right now.

I finally fixed the freezing issue, turns out it was caused by me putting breakpoints in the VMEXIT handler. For some reason the breakpoints in VMEXIT handler caused a hang.
image

Thank you for spending time to fix this with me, I will close this issue now 👍 .

That's interesting and new to me (with outside certain vm-exits). Thank you for sharing this gotcha!