saga recovery stops if Nexus fails to recover a saga
Closed this issue · 1 comments
@askfongjojo was testing how the system behaves after a Nexus failure and reported:
I notice that some sagas aren't resumed when nexus came back after a period of unavailability (the way I tested this was by halting a nexus zone and then booting it up after 20-30s - all of these happened while running
terraform destroy
on a group of instances). Here is an example of a "left behind" saga on rack2 - id: 8f9a2acd-d086-455d-9d58-527fb5e61473. It stays in "running" state indefinitely.
So the symptom is that a saga is hung after a Nexus restart.
Initial questions I had:
- Which Nexus is responsible for running this saga?
- Did it successfully recover the saga (and start running it again)? (If so, it could be a bug in the specific saga itself. But if not, there's something wrong with saga recovery.)
Finding Nexus
I'm including this for completeness but this section doesn't cover anything new about this bug -- just how I got to the right Nexus log.
I didn't see any commands for looking at sagas in omdb (this is #4157 / #4378) so I figured I'd query the database directly. To do that I needed to find the CockroachDB nodes and the PostgreSQL URL. The PostgreSQL URL is spit out by any successful omdb db
invocation, like:
# omdb db sleds
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
...
To find the CockroachDB nodes, I did:
# omdb nexus blueprints show current | grep cockroach
cockroach_db 3237a532-acaa-4ebe-bf11-dde794fea739 in service fd00:1122:3344:109::3
cockroach_db 8bbea076-ff60-4330-8302-383e18140ef3 in service fd00:1122:3344:10b::3
cockroach_db e86845b5-eabd-49f5-9a10-6dfef9066209 in service fd00:1122:3344:108::3
cockroach_db a3628a56-6f85-43b5-be50-71d8f0e04877 in service fd00:1122:3344:107::3
cockroach_db 4c3ef132-ec83-4b1b-9574-7c7d3035f9e9 in service fd00:1122:3344:105::3
I matched up the IP prefixes with the sleds' IP prefixes, which are printed by omdb db sleds
:
# omdb db sleds
note: database URL not specified. Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (77.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)
SERIAL IP ROLE POLICY STATE ID
BRM42220057 [fd00:1122:3344:104::1]:12345 - in service active 0c7011f7-a4bf-4daf-90cc-1c2410103300
BRM23230018 [fd00:1122:3344:124::1]:12345 - not provisionable active 1efda86b-caef-489f-9792-589d7677e59a
BRM44220011 [fd00:1122:3344:103::1]:12345 - in service active 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
BRM42220017 [fd00:1122:3344:109::1]:12345 - in service active 5f6720b8-8a31-45f8-8c94-8e699218f28b
BRM42220051 [fd00:1122:3344:10b::1]:12345 scrimlet in service active 71def415-55ad-46b4-ba88-3ca55d7fb287
BRM44220010 [fd00:1122:3344:101::1]:12345 - in service active 7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
BRM42220014 [fd00:1122:3344:108::1]:12345 scrimlet in service active 87c2c4fc-b0c7-4fef-a305-78f0ed265bbc
BRM42220031 [fd00:1122:3344:102::1]:12345 - in service active a2adea92-b56e-44fc-8a0d-7d63b5fd3b93
BRM42220006 [fd00:1122:3344:106::1]:12345 - in service active b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
BRM27230045 [fd00:1122:3344:123::1]:12345 - not provisionable active bd96ef7c-4941-4729-b6f7-5f47feecbc4b
BRM42220009 [fd00:1122:3344:107::1]:12345 - in service active db183874-65b5-4263-a1c1-ddb2737ae0e9
BRM42220016 [fd00:1122:3344:10a::1]:12345 - in service active dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
BRM44220005 [fd00:1122:3344:105::1]:12345 - in service active f15774c1-b8e5-434f-a493-ec43f96cba06
I picked the first CockroachDB zone, saw that its IP prefix matches BRM42220017 and logged in:
pilot host login BRM42220017
zlogin oxz_cockroachdb_3237a532-acaa-4ebe-bf11-dde794fea739
I found that cockroach sql --url
does not accept a URL containing multiple addresses so I massaged the URL to pick one:
# cockroach sql --url 'postgresql://[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable'
Now, let's take a look at the stuck saga's state:
root@[fd00:1122:3344:108::3]:32221/omicron> \x
root@[fd00:1122:3344:108::3]:32221/omicron> select * from saga where id = '8f9a2acd-d086-455d-9d58-527fb5e61473';
-[ RECORD 1 ]
id | 8f9a2acd-d086-455d-9d58-527fb5e61473
creator | 65a11c18-7f59-41ac-b9e7-680627f996e7
time_created | 2024-06-22 00:49:55.944567+00
name | disk-delete
saga_dag | {"end_node": 13, "graph": {"edge_property": "directed", "edges": [[0, 1, null], [1, 2, null], [2, 3, null], [4, 5, null], [4, 6, null], [6, 7, null], [5, 7, null], [7, 8, null], [8, 9, null], [9, 10, null], [3, 4, null], [10, 11, null], [12, 0, null], [11, 13, null]], "node_holes": [], "nodes": [{"Action": {"action_name": "disk_delete.delete_disk_record", "label": "DeleteDiskRecord", "name": "deleted_disk"}}, {"Action": {"action_name": "disk_delete.space_account", "label": "SpaceAccount", "name": "no_result1"}}, {"Constant": {"name": "params_for_volume_delete_subsaga", "value": {"serialized_authn": {"kind": {"Authenticated": [{"actor": {"SiloUser": {"silo_id": "7bd7623a-68ed-4636-8ecb-b59e3b068787", "silo_user_id": "906e74cb-eab8-4a87-bda8-2cb0914bf853"}}}, {"mapped_fleet_roles": {"admin": ["admin"]}}]}}, "volume_id": "7de86243-3f52-4d60-b2be-a38f009154c8"}}}, {"SubsagaStart": {"params_node_name": "params_for_volume_delete_subsaga", "saga_name": "volume-delete"}}, {"Action": {"action_name": "volume_delete.decrease_crucible_resource_count", "label": "DecreaseCrucibleResourceCount", "name": "crucible_resources_to_delete"}}, {"Action": {"action_name": "volume_delete.delete_crucible_regions", "label": "DeleteCrucibleRegions", "name": "no_result_1"}}, {"Action": {"action_name": "volume_delete.delete_crucible_running_snapshots", "label": "DeleteCrucibleRunningSnapshots", "name": "no_result_2"}}, {"Action": {"action_name": "volume_delete.delete_crucible_snapshots", "label": "DeleteCrucibleSnapshots", "name": "no_result_3"}}, {"Action": {"action_name": "volume_delete.delete_crucible_snapshot_records", "label": "DeleteCrucibleSnapshotRecords", "name": "no_result_4"}}, {"Action": {"action_name": "volume_delete.delete_freed_crucible_regions", "label": "DeleteFreedCrucibleRegions", "name": "no_result_5"}}, {"Action": {"action_name": "volume_delete.hard_delete_volume_record", "label": "HardDeleteVolumeRecord", "name": "final_no_result"}}, {"SubsagaEnd": {"name": "volume_delete_subsaga_no_result"}}, {"Start": {"params": {"disk_id": "0c2fde32-4edc-4671-a68c-31d476a4461e", "project_id": "fe0da422-5c48-4b52-8010-f2fc401f090f", "serialized_authn": {"kind": {"Authenticated": [{"actor": {"SiloUser": {"silo_id": "7bd7623a-68ed-4636-8ecb-b59e3b068787", "silo_user_id": "906e74cb-eab8-4a87-bda8-2cb0914bf853"}}}, {"mapped_fleet_roles": {"admin": ["admin"]}}]}}, "volume_id": "7de86243-3f52-4d60-b2be-a38f009154c8"}}}, "End"]}, "saga_name": "disk-delete", "start_node": 12}
saga_state | running
current_sec | 65a11c18-7f59-41ac-b9e7-680627f996e7
adopt_generation | 1
adopt_time | 2024-06-22 00:49:55.944567+00
It's being executed by Nexus 65a11c18-7f59-41ac-b9e7-680627f996e7. And here's its saga log. This is the entire persistent state of the saga (although I left out the node outputs here) and should tell us which steps are done vs. in-progress vs. not started:
root@[fd00:1122:3344:108::3]:32221/omicron> select node_id,event_type,event_time,creator from saga_node_event where saga_id = '8f9a2acd-d086-455d
node_id | event_type | event_time | creator
----------+------------+-------------------------------+---------------------------------------
0 | started | 2024-06-22 00:49:55.955687+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
0 | succeeded | 2024-06-22 00:49:55.962846+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
1 | started | 2024-06-22 00:49:55.96859+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
1 | succeeded | 2024-06-22 00:49:55.978871+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
2 | started | 2024-06-22 00:49:55.980688+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
2 | succeeded | 2024-06-22 00:49:55.983285+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
3 | started | 2024-06-22 00:49:55.9852+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
3 | succeeded | 2024-06-22 00:49:55.986991+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
4 | started | 2024-06-22 00:49:55.989139+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
4 | succeeded | 2024-06-22 00:49:55.999131+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
5 | started | 2024-06-22 00:49:56.001253+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
5 | succeeded | 2024-06-22 00:49:56.005814+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
6 | started | 2024-06-22 00:49:56.003296+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
6 | succeeded | 2024-06-22 00:49:56.008172+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
7 | started | 2024-06-22 00:49:56.011041+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
7 | succeeded | 2024-06-22 00:49:56.013154+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8 | started | 2024-06-22 00:49:56.014863+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8 | succeeded | 2024-06-22 00:49:56.016756+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
9 | started | 2024-06-22 00:49:56.018449+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
12 | started | 2024-06-22 00:49:55.948363+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
12 | succeeded | 2024-06-22 00:49:55.953173+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
(21 rows)
It looks like we're stuck in node 9.
Using the same approach I used to find a CockroachDB node, I found Nexus on sled 2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa, which is BRM44220011. On that box, I used oxlog
to find the logs. It was a little tricky to find the right one based on timestamp. I wound up looking at the paths emitted by oxlog
, finding the common directories, doing something like ls -lrt /pool/ext/*/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus*
(sorry I don't have the command any more), and finding the one whose mtime was the closest to these timestamps above (2024-06-22T00:49:55). That turned out to be /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901
.
To sanity check that I've got the right one, I looked for SMF-recorded events, which start with [
:
BRM44220011 # grep '^\[ ' /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901
[ Jun 22 00:46:43 Enabled. ]
[ Jun 22 00:46:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/omicron-nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Jun 22 00:46:45 Method "start" exited with status 0. ]
[ Jun 22 00:53:50 Enabled. ]
[ Jun 22 00:53:52 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/omicron-nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Jun 22 00:53:52 Method "start" exited with status 0. ]
We do see two startups there, consistent with Angela's testing.
Nexus log
Now, is there anything in the Nexus log about this saga? Just this:
# grep 8f9a2acd-d086-455d-9d58-527fb5e61473 /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901 | looker
...
00:49:55.944Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): creating saga
file = nexus/db-queries/src/db/sec_store.rs:49
saga_id = 8f9a2acd-d086-455d-9d58-527fb5e61473
saga_name = disk-delete
...
I was surprised not to see any log messages about saga recovery, successful or otherwise. So I looked at the code and saw that the log messages should have "component": "SagaRecoverer"
and looked for that:
# grep SagaRecoverer /pool/ext/bf428719-1b16-4503-99f4-ad95846d916f/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1719018901 | looker
00:46:47.048Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga recovery
file = nexus/db-queries/src/db/saga_recovery.rs:77
00:46:47.202Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): listed sagas (10 total)
file = nexus/db-queries/src/db/saga_recovery.rs:113
00:46:47.232Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to recover saga 7dce104f-0806-4906-a8b6-267f18379e21: Internal Error: failed to resume saga: recovery for saga 7dce104f-0806-4906-a8b6-267f18379e21: node NodeIndex(0): load status is "Succeeded(String("f7934b34-1e6e-4fa2-8cc5-e2456508d45f"))", which is illegal for parent load status "NeverStarted"
file = nexus/db-queries/src/db/saga_recovery.rs:137
00:53:52.679Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): start saga recovery
file = nexus/db-queries/src/db/saga_recovery.rs:77
00:53:52.754Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): listed sagas (16 total)
file = nexus/db-queries/src/db/saga_recovery.rs:113
00:53:52.832Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to recover saga 7dce104f-0806-4906-a8b6-267f18379e21: Internal Error: failed to resume saga: recovery for saga 7dce104f-0806-4906-a8b6-267f18379e21: node NodeIndex(0): load status is "Succeeded(String("f7934b34-1e6e-4fa2-8cc5-e2456508d45f"))", which is illegal for parent load status "NeverStarted"
file = nexus/db-queries/src/db/saga_recovery.rs:137
I thought we'd have messages for each saga that we recovered, but those are only logged at trace level.
Still, these messages look like something deeply wrong with recovery of a saga, but it's not the one we were looking at. I made a note to look at that one later but continued to focus on what's wrong with 8f9a2acd-d086-455d-9d58-527fb5e61473. Was the query that lists unfinished sagas somehow not finding it? I wrote SQL that I thought was equivalent to what the code was executing:
root@[fd00:1122:3344:108::3]:32221/omicron> select id from saga where saga_state != 'done' AND current_sec IS NOT NULL and current_sec = '65a11c18-7f59-41ac-b9e7-680627f996e7';
id
----------------------------------------
7dce104f-0806-4906-a8b6-267f18379e21
8f9a2acd-d086-455d-9d58-527fb5e61473
ae6e094a-8571-4860-8883-17e59f71bfd4
bf4200c4-9b91-41f3-b4d3-f26e5188b09f
c29b403a-e121-48dd-96d8-e62692268d95
e1d49ee4-e6ba-4fe5-90de-c7cc066920f7
e2278b58-ad02-49a0-a9b8-50d567c2cac5
e305e20f-e7a1-463e-a736-d5c2e1735f41
ebc5ae43-1e67-42c7-b3da-1c52e8800ec4
ecc3e25d-049c-4bcb-85ca-11eb7bf7915a
f0957717-83ba-4502-a96d-cc975c2ef9ee
f20e7592-bfb8-425d-ac38-7f16045e6397
fcc68eee-65a6-4e56-b4d5-898258940a81
fff815a9-608b-44d5-9112-3419febfa16f
(14 rows)
The logs reported 10 and 16 sagas found, and now we've got 14, so that's at least in the ballpark. And the saga in question is present there -- as is the one we saw an error about.
At this point I went back through the source to see what it should be doing. I expected that the code would attempt to recover all sagas, even if one failed. Here's what it does. For each saga, it calls recover_saga
. On failure, it logs a warning (the one we saw in the log). Each of these happens in its own future (constructed at L115-L144). We run them in order at L147-L150... but we bail out at L149 (via the ?
) if any of these produces an error. The error returned here will wind up terminating this tokio task, but there's nothing that awaits this task's JoinHandle and so nothing else would notice or do anything about it.
So I think this explains why when some other saga failed to be recovered, we skipped even trying to recover other sagas, leaving a bunch of sagas hung. We still have the question of that earlier saga -- I'll write that up in a new ticket.
It's worth noting that in this case, we would successfully recover any sagas with an id lexicographically earlier than the one that failed. We see evidence of this from the list of unfinished sagas assigned to this Nexus: our broken one is the first one in the list. There may have been earlier ones that we successfully recovered.
There's already a TODO about fixing this:
omicron/nexus/db-queries/src/db/saga_recovery.rs
Lines 116 to 125 in 01d8b37
For other reasons, we already want to make saga recovery a more continual thing that happens in a background task (so that we can notice when we get assigned another Nexus's sagas if that Nexus gets expunged). That will require reworking this code and fixing this should fall out of that.