QubesOS/qubes-issues

Storage pool exceptions handled ungracefully

cfcs opened this issue · 2 comments

cfcs commented

Qubes OS version:

4.0 (according to /etc/fedora-release)

Affected component(s) or functionality:

qubes-core-admin -> qubesd -> storage pool initialization


Steps to reproduce the behavior:

  1. Write a storage driver, install your fork of qubes-core-admin
  2. Catch exception 'during import (otherwise the result is a complete shit show), re-raise wrapped in qubes.storage.StoragePoolException("unable to load")
  3. Observe that qubes/storage/__init__.py doesn't catch the exception on the line that says pool = self.vm.app.get_pool(volume_config['pool'])
  4. sudo journalctl -fu qubesd spams "failed to start" messages.

I tentatively "fixed" it with the patch below, which turned out to not be a good idea (see notes in section below)

try:
    pool = self.vm.app.get_pool(volume_config['pool'])
except qubes.exc.QubesException as e:
    self.vm.log.error('unable to load pool {!r}'.format(e))
    return None

Expected or desired behavior:

Storage pool drivers should be allowed to fail (temporarily, ideally) without bringing down the whole system, and without the affected VMs/pools being erased from qubes.xml (I took a backup of my qubes.xml before trying to fix this, which turned out to be a good idea since my "fix" resulted in that).

Both when loading the storage pool module itself (which I believe this bug report is about), and during init_volume / init_pool.

The affected pools / vms should still be visible, even if they are not startable, or at the very least they should reappear when the transient error is corrected.

Actual behavior:

qubesd fails to start, as a consequence no VMs are able to start and the qvm-* commands fail.

This happened when updating to a new kernel + new kernel-devel package which included "support" for a GCC plugin to collect extra entropy (somewhere in include/linux/random.h), guarded by an ifdef. For some reason the ifdef-guard resulted in the new fancy code being expanded by DKMS, but a symbol (latent_entropy) was missing, causing DKMS compilation of the ZFS module to fail. That's a separate issue, but I suspect there will be more build failures in the future, so I'd like to have a solution to this problem.

My failed attempts to fix this resulted in the pools being "forgotten," I think because I returned None where qubesd expects an updated copy of the dict that tracks storage pool parameters, causing that to be serialized in place of the original date.

I'm not sure if this was caused by qubes/storage/__init__.py or somewhere else (after my patch), but that line definitely failed. I'm not super inclined to repeat the experiment if the issue can be resolved without further information, since it takes some time to conduct safely, but I can do that if there's not other path forward.

General notes:

I'm not sure if I'm doing it wrong / throwing the wrong exception from the wrong place, but I would appreciate some help understanding how to do this gracefully.


I have consulted the following relevant documentation:

I did not :-(

I am aware of the following related, non-duplicate issues:

The patchset which lead to this (ZFS storage pools):
QubesOS/qubes-core-admin#289

1. install your fork of `qubes-core-admin`

This is unnecessary. This is designed using entry points specifically to allow shipping storage drivers in other packages. No need to fork anything. See also https://dev.qubes-os.org/projects/core-admin/en/latest/qubes-storage.html#storage-pool-driver-api

Storage pool drivers should be allowed to fail (temporarily, ideally)

When accessing the storage? Yes. But when importing/initializing the object? I'm not so sure.
I'd say raising an exception when loading the storage driver is a bug in that storage driver.

Anyway, indeed the failure mode here could be improved.
The solution would need some more changes than you propose, because many places assume vm.storage and vm.volumes is properly initialized always. One of the thing is serializing them back to qubes.xml, which you've noticed. But there are many more places like this.

cfcs commented

@marmarek thank you for the pointer to entry points, that is super helpful!

Re: not raising when importing: That sounds reasonable to me, but if you only throw the exceptions at init time there is, as you say, multiple call sites, and some of them catch exceptions and have weird default behaviors (for instance somewhere it reverts to using the default storage pool if initialization fails -- which does not seem like an ideal way to handle the case where there is an existing VM on the missing pool).

(EDIT: To be clear I agree that defining error semantics for pool/volume initialization and raising the exceptions at those points is the clean way to go, and that there should not be exceptions throw on import)