dergachev/vagrant-vbox-snapshot

Intermittent failures with VBoxManage commands seemingly targeted at the wrong VM

Closed this issue · 3 comments

I'm seeing intermittent errors with snapshot restores whereby a VBoxManage error is thrown due to not being able to find the VM, however, on checking the output - the VM UUID stated does not seem to match the UUID stated for the machine when it is powered down.

To work around this in my automation code I restart the VM and restore the snapshot again, hoping that the intermittent error won't occur twice in a row.

You can see the output from my code below, including the output from the failed snapshot restore with the seemingly incorrect UUID.

Any ideas what's going on?

07:16:37.904 [main] INFO c.p.t.t.execute.remote.SSHCommand - Executing command [vagrant snapshot go minion1 postAddWorkspace] on [qa-dcs-pm]
07:16:37.904 [main] DEBUG c.p.t.t.execute.remote.SSHCommand - Working directory: /home/smt/vagrant/ecagent-smt-ubuntu-regression-main-1006545
Powering off machine 8a2755ce-89fd-4569-9785-33cde8e93a06
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Restoring snapshot fdd54252-2591-427e-85e4-8a6df7dcf63a
Starting restored VM
==> minion1: Checking if box 'hashicorp/precise64' is up to date...
There was an Err0r while executing VBoxManage, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.

Command: ["showvminfo", "a9939272-5b15-43dc-8441-44197756f6b8", "--machinereadable"]

Stderr: VBoxManage: Err0r: Could not find a registered machine with UUID {a9939272-5b15-43dc-8441-44197756f6b8}
VBoxManage: Err0r: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBox, interface IVirtualBox, callee nsISupports
VBoxManage: Err0r: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2611 of file VBoxManageInfo.cpp

07:16:49.818 [main] INFO c.p.t.t.execute.remote.SSHCommand - Command [vagrant snapshot go minion1 postAddWorkspace] : Failure
07:16:51.704 [main] INFO c.p.t.t.execute.remote.SSHCommand - Command [vagrant status minion1 --machine-readable] : Success
07:16:51.705 [main] DEBUG c.p.t.t.c.o.MachineReadableOutput - Output state for minion1: saved
07:16:53.611 [main] INFO c.p.t.t.execute.remote.SSHCommand - Command [vagrant status minion1 --machine-readable] : Success
07:16:53.611 [main] DEBUG c.p.t.t.c.o.MachineReadableOutput - Output state for minion1: saved
07:16:53.612 [main] INFO c.p.t.t.execute.remote.SSHCommand - Executing command [vagrant up minion1 --provision] on [qa-dcs-pm]
07:16:53.612 [main] DEBUG c.p.t.t.execute.remote.SSHCommand - Working directory: /home/smt/vagrant/ecagent-smt-ubuntu-regression-main-1006545
Bringing machine 'minion1' up with 'virtualbox' provider...
==> minion1: Checking if box 'hashicorp/precise64' is up to date...
==> minion1: Resuming suspended VM...
==> minion1: Booting VM...
==> minion1: Waiting for machine to boot. This may take a few minutes...
minion1: SSH address: 127.0.0.1:2216
minion1: SSH username: vagrant
minion1: SSH auth method: private key
==> minion1: Machine booted and ready!
07:17:14.230 [main] INFO c.p.t.t.execute.remote.SSHCommand - Command [vagrant up minion1 --provision] : Success
07:17:15.022 [main] INFO c.p.t.t.execute.remote.SSHCommand - Command [vagrant plugin list] : Success
07:17:15.023 [main] INFO c.p.t.t.execute.remote.SSHCommand - Executing command [vagrant snapshot go minion1 postAddWorkspace] on [qa-dcs-pm]
07:17:15.023 [main] DEBUG c.p.t.t.execute.remote.SSHCommand - Working directory: /home/smt/vagrant/ecagent-smt-ubuntu-regression-main-1006545
Powering off machine 8a2755ce-89fd-4569-9785-33cde8e93a06
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Restoring snapshot fdd54252-2591-427e-85e4-8a6df7dcf63a
Starting restored VM
==> minion1: Checking if box 'hashicorp/precise64' is up to date...
==> minion1: Resuming suspended VM...
==> minion1: Booting VM...
==> minion1: Waiting for machine to boot. This may take a few minutes...
minion1: SSH address: 127.0.0.1:2216
minion1: SSH username: vagrant
minion1: SSH auth method: private key
==> minion1: Machine booted and ready!

Hi @davebirch thanks for trying vagrant-vbox-manage.

We did some digging around, and noticed the following:

vagrant -v  #    Vagrant 1.6.3

# let's get a consistent test environment
git clone https://github.com/dergachev/vagrant-vbox-snapshot
cd vagrant-vbox-snapshot

# note we'll play with this default vagrantfile
vim Vagrantfile      # set TEST_MULTI_VM = true

vagrant up --provider=virtualbox

# not in a multi-vm setup, so 'default' is the machine name
vagrant snapshot take db snap01     

vagrant snapshot list    # looks good

# now lets run the restore, with verbose logging
VAGRANT_LOG=info vagrant snapshot go snap01

I recommend that replicate the above steps and figure out what's supposed to be going on. Then enable verbose logging in your environment and see what's different.

When I ran them, I noticed the following interesting snippet in the (very verbose) output of the last command:

INFO warden: Calling OUT action: #<VagrantPlugins::ProviderVirtualBox::Action::Created:0x0000010292d800>
INFO runner: Preparing hooks for middleware sequence...
INFO runner: 4 hooks defined.
INFO runner: Running action: #<Vagrant::Action::Warden:0x00000100eae7b8>
INFO warden: Calling IN action: #<Proc:0x00000100df4548@/Applications/Vagrant/embedded/gems/gems/vagrant-1.6.3/lib/vagrant/action/warden.rb:94 (lambda)>
INFO warden: Calling IN action: #<VagrantPlugins::ProviderVirtualBox::Action::CheckAccessible:0x00000100eae718>
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "d3199f90-02de-4621-b16a-6058a8d5d7a8", "--machinereadable"]
INFO warden: Calling IN action: #<Vagrant::Action::Builtin::EnvSet:0x00000100eae6f0>
INFO warden: Calling IN action: #<VagrantPlugins::ProviderVirtualBox::Action::PrepareForwardedPortCollisionParams:0x00000100eae6c8>
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "list", "vms"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "39cc94cd-19ed-4fc6-9ba7-4dbce4251ae7", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "f4ac65f6-dee3-40c2-826f-d371add157b7", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "d1f8af56-0ce3-4be3-98b2-c0bd5bc5b741", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "ca9d4f8d-5efe-46d4-90bf-6fa30631a5ba", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "efce9183-f4ca-4dd3-8689-c302cca2f286", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "6389ea48-f041-4108-9674-8001cd3e0239", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "69a5755e-d59a-4d5d-ac62-6eeb9a93821c", "--machinereadable"]
INFO subprocess: Starting process: ["/usr/bin/VBoxManage", "showvminfo", "d3199f90-02de-4621-b16a-6058a8d5d7a8", "--machinereadable"]
INFO warden: Calling IN action: #<Vagrant::Action::Builtin::HandleForwardedPortCollisions:0x00000100eae6a0>
INFO environment: Acquired process lock: dotlock

It suggests that when vagrant-vbox-snapshot calls vagrant up on the restored snapshot, vagrant checks for potential port forwarding collisions with other VMs. I'm not sure if this is where your error gets triggered, but perhaps you created a situation where there is a race condition where list of IDs is populated at one moment, then while iteration is happening on them you make one of the IDs invalid? I noticed that you opened other issues (vagrant, vagrant-multiprovider-snap) that you talk about running vagrant commands in parallel. It's not clear how well tested Vagrant is for this use case, and maybe the error comes from this.

Hopefully this helps you to start debugging. I'd be happy to continue helping you on this, but for me to have a good chance I'd need precise steps to reproduce the error on my machine.

Thanks for the reply and the log output, I hadn't realised the standard output for Vagrant wasn't already info level, so had jumped straight to debug level logging and must've missed that section in even more verbose output!

This does explain a lot - I run a lot of vagrant instances on a single beefy box, if it's running a VBoxManage list vms, and then a showinfo against all of these, there is a very good chance that another instance has had a vagrant destroy run against it (by a different test run completing) in the meantime and that the machine indeed doesn't exist any more.

The odd thing about it checking for port forward collisions is that I've also seen snapshot restores fail due to the requested port being in use. I can only assume this is caused by a similar race condition between the machine the snapshot is being restored on and a new instance being brought up both trying to grab the same port.

I'll do some more checking but I'm not sure whether I'll be able to get working reproduction scenario if my assumptions are correct given the very tight timings required to cause this issue. To give you an idea, our current test environment can have upto 8 actively running vagrant instances at a time, with a potential of 5 VMs per instance, each of which is restoring a full set of snapshots every 1-10 minutes during about an hour of test run, even then this only happens intermittently. I'll do some checking, I might be able to put an automated script together to brute force this error condition by repeatedly running up and destroy on a couple of vagrant instances until they fail, but still won't be that good for debugging.

I'll have a think, either way it seems like my issue is being caused by the core vagrant up command - it might be worthwhile for me to gather some more information and raise the issue for them to have a look at?

Thanks again for taking the time to look at this issue, it's most appreciated.

Interesting. Yeah I guess because Vagrant is open-source, you can probably easily instrument it's source code to detect/confirm what the exact race condition is. I'm sure any easily reproducible bug reports could be helpful to the maintainers.

That said, I'm surprised you're using the virtualbox backend for this use case. Maybe containers (lxc, docker, etc) could be a more robust way of dealing with it? Also I guess you'd want to disable the port forwarding feature somehow and just rely on accessing the VM IPs directly, in case that's the cause of your problem. Good luck!