omicron test failures result in panic while panicking that's hard to debug
davepacheco opened this issue · 2 comments
I've heard and seen some test failures in omicron (our internal consumer of steno) where a test fails, and so panics, but then panics again in steno while panicking. I believe this is related to the SecClient's Drop handler, which panics if it can't send a message to the SEC. It's been a little while since I've looked at one so I don't have more info now. Whatever happens, it hides the real test failure and it's annoying to debug. (More information coming.)
The gory details of the failure in omicron are in oxidecomputer/omicron#279.
tl;dr:
- The Drop handler on
SecClient
should probably just go. - We should reconsider the panic behavior from the SecClient and the SecExecClient, but these are less obviously wrong
The facts
In the omicron failure, the SEC panicked on some to-be-filed steno bug. This caused a bunch of follow-on panics:
- A consumer had created a saga and was using the Future returned by SecClient to wait for the saga to complete. That consumer panicked because the SEC was gone.
- A different part of saga execution (the separate exec_node task) was attempting to record a message for the saga log. It panicked because the SEC was gone.
- The integration test driving all of this panicked (probably correctly), causing it to tear down everything, including the SecClient. That panicked in the Drop handler, which assumes the SEC can't stop.
This last panic happened while panicking, causing an abort without a lot to go on. This mess was somewhat clear from the panic messages and stack traces, but definitely took some wading through it all.
What to do about it
When I wrote the SEC and SEC Client, I thought of them as tightly coupled: if the SEC was going to panic, that's a programmer error, and we're going to want the whole process to come down. It didn't make sense to me to make the SecClient interface a bunch more complicated to express the failure condition that the SEC has failed, given that the only consumer we have right now (Nexus) can't do anything useful but panic anyway. That might indeed work okay for Nexus, but it falls apart in automated tests, which do not support panic = abort
(and where it's arguably undesirable there). As we discussed in #14, it's probably better to think about a separate task with which we're communicating over a channel as a separate fault domain.
Given this, I think the Drop
handler on SecClient
is wrong -- it shouldn't panic if it can't send a message to the SEC. This was here because the condition was supposed to be impossible by design so it might represent a real bug (e.g., if the SEC had accidentally dropped some channel handle it was supposed to hold onto). But we really can't prevent this condition.
There's a follow-on question regarding the other channels used to communicate with the SEC:
- SecClient (used by Steno consumers to interact with the SEC), outside the "drop" case. We could continue to panic when we try to do something with a dead SEC. We could also try to propagate this as a first-class error. If we do that, we need to do one of a few things:
- Roll it into our
anyhow::Error
and assume callers will treat it like any other error. This probably isn't good because if the source of this error was an external API request, then an SEC panic is probably a 500 error, but most of the existing errors are probably 400 errors. - Use a structured error (like
thiserror
) rather than anyhow::Error. We'll probably want to make sure it's easy for consumers to identify the SecPanicked case and panic themselves, as that's what Nexus may wind up doing in most cases.
- Roll it into our
- SecExecClient (used internally by Steno SagaExecs to communicate with the SEC). It might be fine to continue panicking in this case. (Longer-term, this mechanism is ugly anyway: we probably want to have these not be separate tasks, but maybe different Futures being await'd on from the top of the SEC.)
Fixed via #18.