VMware Fusion can no longer start virtual machines when HookCase extension is loaded
stoodend opened this issue · 16 comments
What version of Fusion are you using, and which version of OS X are you running it on? Also, what's the version of Windows in the VM that's failing to load?
For what it's worth, I can reproduce this problem with Fusion 10.1.3 (the current version) running on OS X 10.11.6. I did so with two different client VMs -- one running Windows 10 and the other running OS X 10.9. So presumably it happens with any client VM.
I'd still like the information I requested in my previous comment, though.
Happens on OS X 10.13.6 with Fusion 10.1.3, with a client VM running Windows 10. Looks like the OS of the client VM does not make a difference, it happens with any client VM.
Thanks for the information. I think I've found a clue as to what's happening. The following error appears twice in the host's /var/log/system.log each time I try (and fail) to launch a VM (among many other VMware-specific log messages). It doesn't appear when the client VM loads successfully (with HookCase unloaded).
Sep 18 12:02:46 emma vmnet-dhcpd[531]: select: Interrupted system call
Sep 18 12:02:46 emma vmnet-dhcpd[531]: exiting.
Please check your own /var/log/system.log file for similar errors, and let me know your results. I haven't yet been able to find anything else in system.log that seems specific to the HookCase error condition.
The vmware.log file inside the virtual machine package maybe has some useful info?
Maybe the way HookCase injects in the vmware processes is not compatible with the way VMware process ceremony works? Judging by how HookCase is built, is it possible to have it inject only into specific processes and not every process as it is now?
There's nothing at all in the OS X 10.9 client system log for any of the error conditions. Presumably the client OS gets killed before it starts logging to system.log.
HookCase doesn't inject code into a process unless you tell it to (via the HC_INSERT_LIBRARY environment variable). So that's not what's causing the problem. However, HookCase does at least examine every process that starts up while its running, and something there might be causing trouble.
I'll be investigating this bug by trial and error. I'll start by finding out how much of HookCase's functionality I need to turn off to make the problem go away. With luck I'll find the "real" bug and be able to fix it. But failing that I'd be willing to stop HookCase from examining one or more or all VMware processes, if that works -- even though it'd be a terrible hack.
In the meantime you won't be able to use HookCase on a Fusion host machine. If I may ask, though, why do you need to do this? Can't you just not use Fusion while you're using HookCase?
I could, but I am looking at deploying this on a system where I don't know if the user will want to use Fusion or not. Maybe I am old, but I think that any new piece of software you bring on a system that breaks an existing software is bad, so it is just my intention to understand why exactly this happens. Thought of excluding the "VMware processes" just as a temporary workaround, but as any engineer, I am actually more interested in finding out why it breaks, without having to avoid that understanding with a workaround. Thank you for your help with the detective work! Exciting!
I understand fully your sense of caution. In your place I'd also be skeptical about using software that has a serious problem whose cause you don't understand. And I share your feeling that it's important to find the true cause of this bug.
But I don't understand why you're planning to install HookCase on "user" systems. In fact I really don't think that's a good idea. You may have things you want to do with HookCase that would benefit your users. But HookCase is extremely powerful, and think of the damage one of your users could do if they found out it was running, and were able to figure out how to use it, say, to install a key logger.
HookCase is meant for debugging and reverse engineering. I don't think it should be used for other purposes. It shouldn't be just loaded and left running. The only "users" who use it should be those with admin accounts, who can use 'sudo' to load it when needed, and then unload it when they're done with it.
Actually, I want to restrict its capabilities, to only use for hooking a specific 3rd party application that we use on our team, so it couldn't be used to install any library because it would be hard coded with a specific library, no longer reading the environment variable.
Interesting. That should work, but be careful to do it right :-)
I'll be looking into this bug over the next few days. I'll let you know when I have something to test.
I have good news, I think. I've found I can make this bug go away by incrementing the range of interrupts used internally by HookCase. So instead of using the interrupts between 0x20 and 0x23, you'd use those between 0x21 and 0x24. But this implies that Fusion uses "int 0x20" for its own purposes, which is surprising and disturbing. So I need to dig further before I can consider this bug truly fixed. For example I need to find out exactly how Fusion does use "int 0x20", and if it's possible that it might also use other interrupts in "my" range.
In the meantime I want you to test the change described above on your own system(s). I just landed a patch that makes it much easier to change which interrupts HookCase uses. Now all you need to do is change the definitions at line 152 in HookCase.h (HC_INT1 through HC_INT4).
By the way, you should do a full rebuild (not just an incremental one) after these changes. HookCase.s will "compile" differently, and I find that Apple's assembler sometimes misbehaves with incremental rebuilds.
Let me know your results!
I've now dug into this as far as I (probably) can, and have found the following:
"int 0x20" is invoked in Fusion's vmmon kernel extension, in the Task_Switch() method. As best I can tell, this happens when vmmon tries to reflect an interrupt that happens in a client VM to the host machine. This "int 0x20" presumably happens in every client VM as it starts up, but I haven't been able to figure out why. However, in every case I've been able to reproduce, the invocation happens in the context of a call on the host, from user mode, to read and/or write a character device. I can't tell which device, or which user-mode program makes the call, but I suspect it's Fusion-specific. I notice that Fusion installs a "/dev/vmmon" device on the host while it's running, which is a character device. I'd bet this is the device in question, and that it's used by the host to communicate with the client, which would involve running code on the client.
Task_Switch() can invoke any interrupt in the range 0x12, 0x14-0xff. This is presumably so it can reflect (to the host) any interrupt (in this range) than might happen in the client. Judging by my experience so far, and the evidence I've been able to gather, some interrupts in that range are never used, practically speaking. Fusion, in its use of "int 0x20", seems to violate those expectations, as does HookCase itself. But unlike HookCase, Fusion is (mostly) closed-source. So it's difficult to tell which other "unexpected" interrupts it might be possible for it to use, in a client VM, if it wants to.
If you use non-standard interrupts that you assume aren't used by anyone else, it's natural to do so in a contiguous range, as HookCase does. So if Fusion wants to use any more of them, it's likely they'd be in a range starting with 0x20. So, somewhat arbitrarily, I've decided to fix this bug by changing HookCase to use interrupts in the range 0x30-0x37 (including some extra interrupts for future expansion). HookCase will stop using interrupts in the range 0x20-0x23.
It will be several days before I land a patch to make this change. I need to do as much testing as I can to ensure that it doesn't cause problems of its own.
For my own future reference:
If you don't call hook_thread_bootstrap_return() during setup, HookCase has no invocation of "int 0x20" aside from those that only happen via the HC_INSERT_LIBRARY environment variable. In this case you get a kernel panic when Fusion starts a client VM, triggered by a NULL-dereference in HookCase.s's kernel_trampoline() (at "call *%rax"). But the (kernel) stack trace in the crash/panic report is very informative. This is how I got most of my information for the previous comment.
VMware publishes the source code for the version of the vmmon extension that comes with its Linux distro. This actually comes bundled with the distro, but it's hard to find separately. Here's a page I used to get the source code (possibly a little out of date):
This bug should be fixed by HookCase 2.1, which I just landed.