oxidecomputer/propolis

server: failed migrations are reported as having completed before the source VM is fully resumed

Opened this issue · 3 comments

Originally seen in #809; see excerpt below.


The PHD failure here is weird. The failure is in the running_process::import_failure test, which is what I'd expect this change to break if it broke anything. But the failure mode is odd: the first migration fails as expected, but then the source VM immediately gets a stop request, which it immediately handles, rendering it unable to serve as a migration source for the second migration:

{"msg":"migration out failed, resuming","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434800658Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","error":"RemoteError(Destination, \"failed to migrate device state: failed to apply deserialized device state: you have no chance to survive, make your time\")"}
{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434818339Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: d5a22278-1907-4188-beb1-dbaab29b33a3, state: Error }) }","state":"Running","gen":12}
{"msg":"state driver handled event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434839439Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","outcome":"Continue"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.459488929Z","hostname":"buskin","pid":1013,"uri":"/instance","method":"GET","req_id":"2b877526-cd31-4935-81e0-ab86c6740d17","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":173,"response_code":"200"}
{"msg":"requested state via API","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460691072Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","state":"Stop"}
{"msg":"Queuing external request","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460710913Z","hostname":"buskin","pid":1013,"component":"request_queue","component":"vm_state_driver","disposition":"Enqueue","request":"Stop"}
{"msg":"state driver handling event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460729383Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","event":"ExternalRequest(Stop)"}
{"msg":"stopping instance","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460745983Z","hostname":"buskin","pid":1013,"component":"vm_state_driver"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460760814Z","hostname":"buskin","pid":1013,"uri":"/instance/state","method":"PUT","req_id":"a85e15e1-5993-40b9-8fed-5f7ac47337fd","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":142,"response_code":"204"}

We log state changes requested through the framework's VM state change functions, as well as attempts to stop a VM when it's being torn down, and I don't see anything especially suspicious preceding the VM stopping. My best guess is that there's a synchronization bug in phd_framework::Framework::wait_for_cleanup_tasks that's allowing a VM teardown task from a previous test to run and affect this one, but I haven't yet found any evidence of this.

I also can't reproduce this locally, so I've queued a rerun to see if it happens again. Could probably use some more instrumentation in the PHD VM cleanup logic in any case.

Originally posted by @gjcolombo in #809 (comment)

Looking a little more closely at the logs from the most recent failure, I believe the "VM was asked to stop too early" explanation from above is incorrect. The real problem is that the source is rejecting the second target's request to migrate out because the source believes the first attempt to migrate out hasn't resolved yet:

{"msg":"Queuing external request","v":0,"name":"propolis-server","level":30,"time":"2024-12-18T02:29:36.207901141Z","hostname":"sock","pid":879,"component":"request_queue","component":"vm_state_driver","disposition":"Deny(AlreadyMigrationSource)","request":"MigrateAsSource { migration_id: 61dd5033-b6ad-4b1b-9a52-d42539060b04 }"}
{"msg":"Error returned from handler: ForbiddenStateChange(AlreadyMigrationSource)","v":0,"name":"propolis-server","level":50,"time":"2024-12-18T02:29:36.207965401Z","hostname":"sock","pid":879,"upgrade":"websocket","uri":"/instance/migrate/61dd5033-b6ad-4b1b-9a52-d42539060b04/start","method":"GET","req_id":"6f804c45-25dc-4142-9499-e510118026d7","remote_addr":"127.0.0.1:45215","local_addr":"127.0.0.1:9000"}

The migration ID in the first message lines up with the migration IDs in the PHD logs.

The reason this can happen is that when a migration fails, Propolis publishes two separate state updates. First, the migration runner publishes that the migration failed (while leaving the VM itself in the "Migrating" state):

{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-12-18T02:29:36.203591685Z","hostname":"sock","pid":879,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: 09d84ec5-9c79-4bb6-be77-971d501830e2, state: Error }) }","state":"Migrating","gen":11}

Then it resumes the VM's devices and moves it back to "Running":

{"msg":"migration out failed, resuming","v":0,"name":"propolis-server","level":30,"time":"2024-12-18T02:29:36.208712193Z","hostname":"sock","pid":879,"component":"vm_state_driver","error":"RemoteError(Destination, \"failed to migrate device state: failed to apply deserialized device state: you have no chance to survive, make your time\")"}
{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-12-18T02:29:36.208727293Z","hostname":"sock","pid":879,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: 09d84ec5-9c79-4bb6-be77-971d501830e2, state: Error }) }","state":"Running","gen":12}

A new migration is only allowed to start after the second transition, but phd_framework::test_vm::TestVm::migrate_from only waits for the first one.

The control plane behaves similarly to PHD here (the instance update saga will retire a migration attempt and clear the instance's migration ID as soon as the migration itself is in a terminal state, even if the source instance hasn't moved back to the "running" state yet). That makes me lean a bit toward trying to fix this in Propolis (i.e. only publish the final transition to the "Error" migration state once the source VM has resumed and is ready to go back to "Running").

Note that this probably has very little impact on a production deployment:

  • Many, many more things need to happen in the "migration has failed but the source hasn't resumed yet" window before Omicron will decide to let another migration attempt occur, so Nexus is much less likely to order up a new migration for an instance while a previous failed migration is in limbo
  • If the problem does occur, Nexus can just wait a bit and then try to migrate again

Nevertheless, this flake is annoying and needs to be fixed.