sled-agent probably needs to set failmode on zpools

Question

sled-agent probably needs to set failmode on zpools

Closed this issue a year ago · 0 comments

A lengthy debugging session on some unrelated bugs revealed the following essential behaviour with respect to an ordinary mirrored pool created by hand on a mirrored pair of SSDs:

19:26 wesolows | So to summarise: in the process of exporting the pool, which ::spa thinks completed just
               | dandy, we encountered a checksum error on a read.  There is no conceivable way that
               | retrying or waiting for anything is ever going to make that succeed, but nevertheless we're
               | sitting here waiting on some oracular operator to restore the proper contents of that
               | block, which is not readily identifiable nor are the contents present anywhere else we'd
               | have done this automatically already, so we come to rest with a mutex held.
19:30 wesolows | The punch line is that the checksum errors appeared spontaneously during a previous boot in
               | which all I was doing was importing and exporting the pool alternating between each of the
               | mirrored vdevs being unavailable.  After that wedged, I rebooted and now this.
 19:30      jmc | so we could at least confirm that, say, setting the mode to continue might be better
 19:31      jmc |
 19:31          |        failmode=wait|continue|panic
 19:31          |                Controls the system behavior in the event of catastrophic pool
 19:31          |                failure.  This condition is typically a result of a loss of
 19:31          |                connectivity to the underlying storage device(s) or a failure of
 19:31          |                all devices within the pool.  The behavior of such an event is
 19:31          |                determined as follows:
 19:31          |
 19:31          |                wait      Blocks all I/O access until the device connectivity is
 19:31          |                          recovered and the errors are cleared.  This is the
 19:31          |                          default behavior.
 19:31          |
 19:31          |                continue  Returns EIO to any new write I/O requests but allows
 19:31          |                          reads to any of the remaining healthy devices.  Any
 19:31          |                          write requests that have yet to be committed to disk
 19:31          |                          would be blocked.
 19:31          |
 19:31          |                panic     Prints out a message to the console and generates a
 19:31          |                          system crash dump.

In this particular case, it might at least sometimes be possible for this to succeed, but perhaps it has failed because we're in the export path. Regardless, the pools Crucible uses, and presumably any others sled-agent is going to create, have only a single vdev. They will never, ever be able to recovery from any kind of I/O failure by waiting for something to happen. If retries didn't work, we're done. Therefore, we should consider setting failmode to either continue or panic.

There is some possible complexity here: the intended use case for failmode is not at all how we got here in the case under investigation: the SSDs are present and electrically functional. One thing we don't want to do is allow a situation in which the single vdev backing a pool has metadata corruption and we end up in a panic loop. Ideally, this kind of condition will result in a faulted pool, and that diagnosis can be persisted across a reboot so that sled-agent will know that recovery should not be attempted (i.e., it needs to reinitialise the underlying device and have Crucible recover/reconstruct). In the absence of that, I'm unsure whether continue is preferable to panic. We will want to consider this also in light of actual loss of connectivity to the underlying device during surprise hotplug.

This is only for pools managed by sled-agent; the ramdisk-backed root pool is unrelated and has different considerations associated with it.

cc: @leftwo @ahl @jclulow @askfongjojo