saga recovery stops if Nexus fails to recover a saga

@askfongjojo was testing how the system behaves after a Nexus failure and reported:

I notice that some sagas aren't resumed when nexus came back after a period of unavailability (the way I tested this was by halting a nexus zone and then booting it up after 20-30s - all of these happened while running terraform destroy on a group of instances). Here is an example of a "left behind" saga on rack2 - id: 8f9a2acd-d086-455d-9d58-527fb5e61473. It stays in "running" state indefinitely.

So the symptom is that a saga is hung after a Nexus restart.

Initial questions I had:

Which Nexus is responsible for running this saga?
Did it successfully recover the saga (and start running it again)? (If so, it could be a bug in the specific saga itself. But if not, there's something wrong with saga recovery.)

Finding Nexus

I'm including this for completeness but this section doesn't cover anything new about this bug -- just how I got to the right Nexus log.

I didn't see any commands for looking at sagas in omdb (this is #4157 / #4378) so I figured I'd query the database directly. To do that I needed to find the CockroachDB nodes and the PostgreSQL URL. The PostgreSQL URL is spit out by any successful omdb db invocation, like:

# omdb db sleds
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
...

To find the CockroachDB nodes, I did:

# omdb nexus blueprints show current | grep cockroach
    cockroach_db   3237a532-acaa-4ebe-bf11-dde794fea739   in service    fd00:1122:3344:109::3
    cockroach_db      8bbea076-ff60-4330-8302-383e18140ef3   in service    fd00:1122:3344:10b::3
    cockroach_db      e86845b5-eabd-49f5-9a10-6dfef9066209   in service    fd00:1122:3344:108::3
    cockroach_db   a3628a56-6f85-43b5-be50-71d8f0e04877   in service    fd00:1122:3344:107::3
    cockroach_db      4c3ef132-ec83-4b1b-9574-7c7d3035f9e9   in service    fd00:1122:3344:105::3

I matched up the IP prefixes with the sleds' IP prefixes, which are printed by omdb db sleds:

# omdb db sleds
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (77.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)
 SERIAL       IP                             ROLE      POLICY             STATE   ID
 BRM42220057  [fd00:1122:3344:104::1]:12345  -         in service         active  0c7011f7-a4bf-4daf-90cc-1c2410103300
 BRM23230018  [fd00:1122:3344:124::1]:12345  -         not provisionable  active  1efda86b-caef-489f-9792-589d7677e59a
 BRM44220011  [fd00:1122:3344:103::1]:12345  -         in service         active  2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 BRM42220017  [fd00:1122:3344:109::1]:12345  -         in service         active  5f6720b8-8a31-45f8-8c94-8e699218f28b
 BRM42220051  [fd00:1122:3344:10b::1]:12345  scrimlet  in service         active  71def415-55ad-46b4-ba88-3ca55d7fb287
 BRM44220010  [fd00:1122:3344:101::1]:12345  -         in service         active  7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
 BRM42220014  [fd00:1122:3344:108::1]:12345  scrimlet  in service         active  87c2c4fc-b0c7-4fef-a305-78f0ed265bbc
 BRM42220031  [fd00:1122:3344:102::1]:12345  -         in service         active  a2adea92-b56e-44fc-8a0d-7d63b5fd3b93
 BRM42220006  [fd00:1122:3344:106::1]:12345  -         in service         active  b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
 BRM27230045  [fd00:1122:3344:123::1]:12345  -         not provisionable  active  bd96ef7c-4941-4729-b6f7-5f47feecbc4b
 BRM42220009  [fd00:1122:3344:107::1]:12345  -         in service         active  db183874-65b5-4263-a1c1-ddb2737ae0e9
 BRM42220016  [fd00:1122:3344:10a::1]:12345  -         in service         active  dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
 BRM44220005  [fd00:1122:3344:105::1]:12345  -         in service         active  f15774c1-b8e5-434f-a493-ec43f96cba06

I picked the first CockroachDB zone, saw that its IP prefix matches BRM42220017 and logged in:

pilot host login BRM42220017
zlogin oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739

I found that cockroach sql --url does not accept a URL containing multiple addresses so I massaged the URL to pick one:

# cockroach sql --url 'postgresql://[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable'

Now, let's take a look at the stuck saga's state:

root@[fd00:1122:3344:108::3]:32221/omicron> \x
root@[fd00:1122:3344:108::3]:32221/omicron> select * from saga where id = '8f9a2acd-d086-455d-9d58-527fb5e61473';
-[ RECORD 1 ]
id               | 8f9a2acd-d086-455d-9d58-527fb5e61473
creator          | 65a11c18-7f59-41ac-b9e7-680627f996e7
time_created     | 2024-06-22 00:49:55.944567+00
name             | disk-delete
saga_dag         | {"end_node": 13, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [4, 5, null], [4, 6, null], [6, 7, null], [5, 7, null], [7, 8, null], [8, 9, null], [9, 10, null], [3, 4, null], [10, 11, null], [12, 0, null], [11, 13, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "disk_delete.delete_disk_record", "label": "DeleteDiskRecord", "name": "deleted_disk"}}, {"Action": {"action_name": "disk_delete.space_account", "label": "SpaceAccount", "name": "no_result1"}}, {"Constant": {"name": "params_for_volume_delete_subsaga", "value": {"serialized_authn": {"kind": {"Authenticated": [{"actor": {"SiloUser": {"silo_id": "7bd7623a-68ed-4636-8ecb-b59e3b068787", "silo_user_id": "906e74cb-eab8-4a87-bda8-2cb0914bf853"}}}, {"mapped_fleet_roles": {"admin": ["admin"]}}]}}, "volume_id": "7de86243-3f52-4d60-b2be-a38f009154c8"}}}, {"SubsagaStart": {"params_node_name": "params_for_volume_delete_subsaga", "saga_name": "volume-delete"}}, {"Action": {"action_name": "volume_delete.decrease_crucible_resource_count", "label": "DecreaseCrucibleResourceCount", "name": "crucible_resources_to_delete"}}, {"Action": {"action_name": "volume_delete.delete_crucible_regions", "label": "DeleteCrucibleRegions", "name": "no_result_1"}}, {"Action": {"action_name": "volume_delete.delete_crucible_running_snapshots", "label": "DeleteCrucibleRunningSnapshots", "name": "no_result_2"}}, {"Action": {"action_name": "volume_delete.delete_crucible_snapshots", "label": "DeleteCrucibleSnapshots", "name": "no_result_3"}}, {"Action": {"action_name": "volume_delete.delete_crucible_snapshot_records", "label": "DeleteCrucibleSnapshotRecords", "name": "no_result_4"}}, {"Action": {"action_name": "volume_delete.delete_freed_crucible_regions", "label": "DeleteFreedCrucibleRegions", "name": "no_result_5"}}, {"Action": {"action_name": "volume_delete.hard_delete_volume_record", "label": "HardDeleteVolumeRecord", "name": "final_no_result"}}, {"SubsagaEnd": {"name": "volume_delete_subsaga_no_result"}}, {"Start": {"params": {"disk_id": "0c2fde32-4edc-4671-a68c-31d476a4461e", "project_id": "fe0da422-5c48-4b52-8010-f2fc401f090f", "serialized_authn": {"kind": {"Authenticated": [{"actor": {"SiloUser": {"silo_id": "7bd7623a-68ed-4636-8ecb-b59e3b068787", "silo_user_id": "906e74cb-eab8-4a87-bda8-2cb0914bf853"}}}, {"mapped_fleet_roles": {"admin": ["admin"]}}]}}, "volume_id": "7de86243-3f52-4d60-b2be-a38f009154c8"}}}, "End"]}, "saga_name": "disk-delete", "start_node": 12}
saga_state       | running
current_sec      | 65a11c18-7f59-41ac-b9e7-680627f996e7
adopt_generation | 1
adopt_time       | 2024-06-22 00:49:55.944567+00

It's being executed by Nexus 65a11c18-7f59-41ac-b9e7-680627f996e7. And here's its saga log. This is the entire persistent state of the saga (although I left out the node outputs here) and should tell us which steps are done vs. in-progress vs. not started:

root@[fd00:1122:3344:108::3]:32221/omicron> select node_id,event_type,event_time,creator from saga_node_event where saga_id = '8f9a2acd-d086-455d
  node_id | event_type |          event_time           |               creator
----------+------------+-------------------------------+---------------------------------------
        0 | started    | 2024-06-22 00:49:55.955687+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        0 | succeeded  | 2024-06-22 00:49:55.962846+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        1 | started    | 2024-06-22 00:49:55.96859+00  | 65a11c18-7f59-41ac-b9e7-680627f996e7
        1 | succeeded  | 2024-06-22 00:49:55.978871+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        2 | started    | 2024-06-22 00:49:55.980688+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        2 | succeeded  | 2024-06-22 00:49:55.983285+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        3 | started    | 2024-06-22 00:49:55.9852+00   | 65a11c18-7f59-41ac-b9e7-680627f996e7
        3 | succeeded  | 2024-06-22 00:49:55.986991+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        4 | started    | 2024-06-22 00:49:55.989139+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        4 | succeeded  | 2024-06-22 00:49:55.999131+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        5 | started    | 2024-06-22 00:49:56.001253+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        5 | succeeded  | 2024-06-22 00:49:56.005814+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        6 | started    | 2024-06-22 00:49:56.003296+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        6 | succeeded  | 2024-06-22 00:49:56.008172+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        7 | started    | 2024-06-22 00:49:56.011041+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        7 | succeeded  | 2024-06-22 00:49:56.013154+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        8 | started    | 2024-06-22 00:49:56.014863+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        8 | succeeded  | 2024-06-22 00:49:56.016756+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
        9 | started    | 2024-06-22 00:49:56.018449+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
       12 | started    | 2024-06-22 00:49:55.948363+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
       12 | succeeded  | 2024-06-22 00:49:55.953173+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
(21 rows)

It looks like we're stuck in node 9.

Using the same approach I used to find a CockroachDB node, I found Nexus on sled 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa, which is BRM44220011. On that box, I used oxlog to find the logs. It was a little tricky to find the right one based on timestamp. I wound up looking at the paths emitted by oxlog, finding the common directories, doing something like ls -lrt /pool/ext/*/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus* (sorry I don't have the command any more), and finding the one whose mtime was the closest to these timestamps above (2024-06-22T00:49:55). That turned out to be /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901.

To sanity check that I've got the right one, I looked for SMF-recorded events, which start with [ :

BRM44220011 # grep '^\[ ' /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901
[ Jun 22 00:46:43 Enabled. ]
[ Jun 22 00:46:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/omicron-nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Jun 22 00:46:45 Method "start" exited with status 0. ]
[ Jun 22 00:53:50 Enabled. ]
[ Jun 22 00:53:52 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/omicron-nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Jun 22 00:53:52 Method "start" exited with status 0. ]

We do see two startups there, consistent with Angela's testing.

Nexus log

Now, is there anything in the Nexus log about this saga? Just this:

# grep 8f9a2acd-d086-455d-9d58-527fb5e61473 /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901 | looker
...
00:49:55.944Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): creating saga
    file = nexus/db-queries/src/db/sec_store.rs:49
    saga_id = 8f9a2acd-d086-455d-9d58-527fb5e61473
    saga_name = disk-delete
...

I was surprised not to see any log messages about saga recovery, successful or otherwise. So I looked at the code and saw that the log messages should have "component": "SagaRecoverer" and looked for that:

# grep SagaRecoverer /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901 | looker
00:46:47.048Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga recovery
    file = nexus/db-queries/src/db/saga_recovery.rs:77
00:46:47.202Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): listed sagas (10 total)
    file = nexus/db-queries/src/db/saga_recovery.rs:113
00:46:47.232Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to recover saga 7dce104f-0806-4906-a8b6-267f18379e21: Internal Error: failed to resume saga: recovery for saga 7dce104f-0806-4906-a8b6-267f18379e21: node NodeIndex(0): load status is "Succeeded(String("f7934b34-1e6e-4fa2-8cc5-e2456508d45f"))", which is illegal for parent load status "NeverStarted"
    file = nexus/db-queries/src/db/saga_recovery.rs:137
00:53:52.679Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga recovery
    file = nexus/db-queries/src/db/saga_recovery.rs:77
00:53:52.754Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): listed sagas (16 total)
    file = nexus/db-queries/src/db/saga_recovery.rs:113
00:53:52.832Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to recover saga 7dce104f-0806-4906-a8b6-267f18379e21: Internal Error: failed to resume saga: recovery for saga 7dce104f-0806-4906-a8b6-267f18379e21: node NodeIndex(0): load status is "Succeeded(String("f7934b34-1e6e-4fa2-8cc5-e2456508d45f"))", which is illegal for parent load status "NeverStarted"
    file = nexus/db-queries/src/db/saga_recovery.rs:137

I thought we'd have messages for each saga that we recovered, but those are only logged at trace level.

Still, these messages look like something deeply wrong with recovery of a saga, but it's not the one we were looking at. I made a note to look at that one later but continued to focus on what's wrong with 8f9a2acd-d086-455d-9d58-527fb5e61473. Was the query that lists unfinished sagas somehow not finding it? I wrote SQL that I thought was equivalent to what the code was executing:

root@[fd00:1122:3344:108::3]:32221/omicron> select id from saga where saga_state != 'done' AND current_sec IS NOT NULL and current_sec = '65a11c18-7f59-41ac-b9e7-680627f996e7';
                   id
----------------------------------------
  7dce104f-0806-4906-a8b6-267f18379e21
  8f9a2acd-d086-455d-9d58-527fb5e61473
  ae6e094a-8571-4860-8883-17e59f71bfd4
  bf4200c4-9b91-41f3-b4d3-f26e5188b09f
  c29b403a-e121-48dd-96d8-e62692268d95
  e1d49ee4-e6ba-4fe5-90de-c7cc066920f7
  e2278b58-ad02-49a0-a9b8-50d567c2cac5
  e305e20f-e7a1-463e-a736-d5c2e1735f41
  ebc5ae43-1e67-42c7-b3da-1c52e8800ec4
  ecc3e25d-049c-4bcb-85ca-11eb7bf7915a
  f0957717-83ba-4502-a96d-cc975c2ef9ee
  f20e7592-bfb8-425d-ac38-7f16045e6397
  fcc68eee-65a6-4e56-b4d5-898258940a81
  fff815a9-608b-44d5-9112-3419febfa16f
(14 rows)

The logs reported 10 and 16 sagas found, and now we've got 14, so that's at least in the ballpark. And the saga in question is present there -- as is the one we saw an error about.

At this point I went back through the source to see what it should be doing. I expected that the code would attempt to recover all sagas, even if one failed. Here's what it does. For each saga, it calls recover_saga. On failure, it logs a warning (the one we saw in the log). Each of these happens in its own future (constructed at L115-L144). We run them in order at L147-L150... but we bail out at L149 (via the ?) if any of these produces an error. The error returned here will wind up terminating this tokio task, but there's nothing that awaits this task's JoinHandle and so nothing else would notice or do anything about it.

So I think this explains why when some other saga failed to be recovered, we skipped even trying to recover other sagas, leaving a bunch of sagas hung. We still have the question of that earlier saga -- I'll write that up in a new ticket.

It's worth noting that in this case, we would successfully recover any sagas with an id lexicographically earlier than the one that failed. We see evidence of this from the list of unfinished sagas assigned to this Nexus: our broken one is the first one in the list. There may have been earlier ones that we successfully recovered.

There's already a TODO about fixing this:

omicron/nexus/db-queries/src/db/saga_recovery.rs

Lines 116 to 125 in 01d8b37

    
           // TODO-robustness We should put this into a retry loop.  We may 
        
           // also want to take any failed sagas and put them at the end of the 
        
           // queue.  It shouldn't really matter, in that the transient 
        
           // failures here are likely to affect recovery of all sagas. 
        
           // However, it's conceivable we misclassify a permanent failure as a 
        
           // transient failure, or that a transient failure is more likely to 
        
           // affect some sagas than others (e.g, data on a different node, or 
        
           // it has a larger log that requires more queries).  To avoid one 
        
           // bad saga ruining the rest, we should try to recover the rest 
        
           // before we go back to one that's failed.

For other reasons, we already want to make saga recovery a more continual thing that happens in a background task (so that we can notice when we get assigned another Nexus's sagas if that Nexus gets expunged). That will require reworking this code and fixing this should fall out of that.

	// TODO-robustness We should put this into a retry loop. We may
	// also want to take any failed sagas and put them at the end of the
	// queue. It shouldn't really matter, in that the transient
	// failures here are likely to affect recovery of all sagas.
	// However, it's conceivable we misclassify a permanent failure as a
	// transient failure, or that a transient failure is more likely to
	// affect some sagas than others (e.g, data on a different node, or
	// it has a larger log that requires more queries). To avoid one
	// bad saga ruining the rest, we should try to recover the rest
	// before we go back to one that's failed.