CI flake: migrating out of a VM doesn't always ensure its VMM resources are cleaned up
Opened this issue · 1 comments
Example victim run (there have been others): https://github.com/oxidecomputer/propolis/pull/646/checks?check_run_id=21668964010
All the tests in this run passed, but the migration_smoke
test's source VM's VMM handle was leaked, causing the overall run to fail.
From the test logs we can see that Propolis returned a 404 when trying to stop this VM. This is expected from a migration source, since the VM controller gets torn down once a migration succeeds:
phd-runner: [VM CLEANUP - EVENT] error stopping VM to move it to Destroyed
error = Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "9ed33d45-46a7-4994-9d91-254dea28b142", "content-length": "84", "date": "Fri, 16 Feb 2024 20:24:02 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "9ed33d45-46a7-4994-9d91-254dea28b142" }
file = phd-tests/framework/src/test_vm/mod.rs
line = 917
path = phd_tests::migrate::smoke_test
target = phd_framework::test_vm
vm = migration_smoke
vm_id = c51ddb74-4366-4c94-8aa4-740ec0bc72b3
This is not a production-impacting problem because in a real control plane, Propolis runs in a zone, and migrating out of a Propolis will (or should) direct the control plane to destroy the zone, which will clean up all remaining VMM resources. Still, this is an annoying flake and we should fix it.
I took a very brief look at the (presumed) reproduction of this from PR #646. There the leaked VM is c51ddb74-4366-4c94-8aa4-740ec0bc72b3
(per the test log), which appears in the PHD log to be the migration_smoke
source VM. Looking through its logs I can see that its VM controller did in fact get dropped:
{"msg":"Dropping VM controller","v":0,"name":"propolis-server","level":30,"time":"2024-02-16T20:24:02.141233373Z","hostname":"buskin","pid":1077,"component":"vm_controller"}
This drop impl is supposed to take care of dropping the underlying Instance (N.B. this snippet is from the code at #646 and not at current master):
propolis/bin/propolis-server/src/lib/vm/mod.rs
Lines 849 to 854 in b2a4200
That's supposed to forcibly drop the kernel VMM:
propolis/lib/propolis/src/instance.rs
Lines 50 to 63 in b2a4200
(As of 19d9eb3 the Instance
indirection is gone and VmController
's drop impl cleans out the Machine
directly.)
The PHD runner logs also show that this Propolis process was killed (or at least that the framework definitely tried to kill it):
phd-runner: [VM CLEANUP - EVENT] Killing Propolis server process
file = phd-tests/framework/src/test_vm/server.rs
line = 125
path = phd_tests::migrate::smoke_test
self.address = 127.0.0.1:9000
target = phd_framework::test_vm::server
vm = migration_smoke
vm_id = c51ddb74-4366-4c94-8aa4-740ec0bc72b3
This is all relevant because the easiest way to leak a VMM is to kill a Propolis server without actually dropping all the references to its living VmController
. That doesn't seem to have happened here, so something subtler is probably going on somewhere.
I've seen this locally in some of my recent migration testing, but it's pretty infrequent--I'll need to set up a little more scaffolding to try to make sure I have more information if I can still get this to repro.