oxidecomputer/propolis

CI flake: migrating out of a VM doesn't always ensure its VMM resources are cleaned up

Opened this issue · 1 comments

Example victim run (there have been others): https://github.com/oxidecomputer/propolis/pull/646/checks?check_run_id=21668964010

All the tests in this run passed, but the migration_smoke test's source VM's VMM handle was leaked, causing the overall run to fail.

From the test logs we can see that Propolis returned a 404 when trying to stop this VM. This is expected from a migration source, since the VM controller gets torn down once a migration succeeds:

phd-runner: [VM CLEANUP - EVENT] error stopping VM to move it to Destroyed
    error = Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "9ed33d45-46a7-4994-9d91-254dea28b142", "content-length": "84", "date": "Fri, 16 Feb 2024 20:24:02 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "9ed33d45-46a7-4994-9d91-254dea28b142" }
    file = phd-tests/framework/src/test_vm/mod.rs
    line = 917
    path = phd_tests::migrate::smoke_test
    target = phd_framework::test_vm
    vm = migration_smoke
    vm_id = c51ddb74-4366-4c94-8aa4-740ec0bc72b3

This is not a production-impacting problem because in a real control plane, Propolis runs in a zone, and migrating out of a Propolis will (or should) direct the control plane to destroy the zone, which will clean up all remaining VMM resources. Still, this is an annoying flake and we should fix it.

I took a very brief look at the (presumed) reproduction of this from PR #646. There the leaked VM is c51ddb74-4366-4c94-8aa4-740ec0bc72b3 (per the test log), which appears in the PHD log to be the migration_smoke source VM. Looking through its logs I can see that its VM controller did in fact get dropped:

{"msg":"Dropping VM controller","v":0,"name":"propolis-server","level":30,"time":"2024-02-16T20:24:02.141233373Z","hostname":"buskin","pid":1077,"component":"vm_controller"}

This drop impl is supposed to take care of dropping the underlying Instance (N.B. this snippet is from the code at #646 and not at current master):

let instance = self
.vm_objects
.instance
.take()
.expect("VM controller should have an instance at drop");
drop(instance);

That's supposed to forcibly drop the kernel VMM:

impl Drop for Instance {
fn drop(&mut self) {
// Make sure the inventory is cleared, even if the consumer chose not to
// call Instance::destroy().
self.0.get_mut().unwrap().inventory.clear();
// If Instance::destroy() was not called prior to drop, let's cleanup
// the underlying VMM resource.
if let Some(machine) = self.0.get_mut().unwrap().machine.take() {
let hdl = machine.destroy();
let _ = hdl.destroy();
}
}
}

(As of 19d9eb3 the Instance indirection is gone and VmController's drop impl cleans out the Machine directly.)

The PHD runner logs also show that this Propolis process was killed (or at least that the framework definitely tried to kill it):

phd-runner: [VM CLEANUP - EVENT] Killing Propolis server process
    file = phd-tests/framework/src/test_vm/server.rs
    line = 125
    path = phd_tests::migrate::smoke_test
    self.address = 127.0.0.1:9000
    target = phd_framework::test_vm::server
    vm = migration_smoke
    vm_id = c51ddb74-4366-4c94-8aa4-740ec0bc72b3

This is all relevant because the easiest way to leak a VMM is to kill a Propolis server without actually dropping all the references to its living VmController. That doesn't seem to have happened here, so something subtler is probably going on somewhere.

I've seen this locally in some of my recent migration testing, but it's pretty infrequent--I'll need to set up a little more scaffolding to try to make sure I have more information if I can still get this to repro.