oxidecomputer/steno

figure out how actions handle invariant violations

davepacheco opened this issue · 1 comments

Moved from #26, where @bnaecker wrote:

It's also not clear how sagas handle invariants that they would like to assert. This would normally just abort/unwind the program, according to the disposition it was built with. One could imagine catching these and having some policy around retrying the operations, potentially up to some count, specified at creation time. It'll take some care to make sure we don't block multiple sagas, or worse, prevent those later sagas from ever running to completion if an earlier one fails.

This may be more of an Omicron concern.

A large class of things you might want to assert as invariant are actually reasonable operational errors. For example, if a saga action looks at data from a previous node, it has to assume that that data came from the saga log, which probably went through a database or some source outside the Rust program. It's always possible that's been modified or corrupted, and so it's not a programmer error if that's different from what we expect. (In an ideal world, any problems would be identified when the type is deserialized.)

So ignoring those, that leaves us with real programmer-error-type invariants, which I'll consider synonymous with panics here. There are a bunch of ways to handle this. It's up to the consumer (for us, that's Nexus) to choose an approach, though we may decide to add stuff to Steno to facilitate some of these options. Some ideas:

  1. Say that it's always wrong to panic in an Action. I don't love this. (As a side note: it doesn't deal with the case of steno panicking.)
  2. Allow panics and handle them explicitly such that they don't cause the process to exit, maybe by putting the saga into a terminal "needs support" state. This assumes that the blast radius of the invariant violation is limited to the saga itself. That might be okay, but I find it scary. In my experience outside of Rust, that's not generally true, and making that assumption can turn an otherwise recoverable transient failure into a cascading failure that requires operator intervention.
  3. Allow panics, and take no special action, allowing the consumer to crash. The risk of this is that most consumers will recover the saga and resume where it was, so there's a decent chance the panic happens again and we wind up in a crash loop.
  4. Allow panics, take no special action, allow the consumer to crash, but attempt to detect when this is happening to contain the damage. For example, every time we recover a saga, we can bump an "attempt" count. On recovery, if we find a saga with an attempt count that's too high, we give up and put it directly into a "NeedsSupport" state without running anything from it. The risk of this is that we also take out any sagas that are running concurrently with the one that's crashing. That's potentially an okay middle ground, and maybe more sophisticated strategies can improve this.