_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/refind-btrfs/local_db'
Closed this issue · 11 comments
Currently running into the aforementioned issue when refind-btrfs.service
gets executed. Started happening after I moved /var
to its own @var
subvolume (copied all children with cp --archive -r --reflink=always
). Deleting /var/lib/refind-btrfs/local_db
and restarting refind-btrfs.service
works once or twice before failing again, tested by creating a snapshot with snapper -c root create --description "test"
and deleting it (a couple of times).
Package versions:
- refind-btrfs 0.6.2-1
- refind 0.14.2-1
- snapper 0.11.2-1
journalctl -u refind-btrfs.service
excerpt (full log: refind-btrfs.log)
-- Boot 8a3354baeed148c69957320a03564c58 --
Sep 25 15:24:10 archlinux systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Sep 25 15:24:12 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/2, /.snapshots/3, /.snapshots/4, /.snapshots/5, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Sep 25 15:24:12 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Starting the observer with PID 1112.
Sep 25 15:24:12 archlinux systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: The '/.snapshots/15' snapshot has been created.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the block devices using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the physical partition table for device '/dev/sda' using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the live partition table for device '/dev/sda' using findmnt.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the physical partition table for device '/dev/sr0' using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the live partition table for device '/dev/sr0' using findmnt.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found the ESP mounted at '/boot' on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found the root partition on '/dev/sda2'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found a separate boot partition on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found subvolume '@' mounted as the root partition.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found 15 snapshots of the '@' subvolume.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Searching for the 'refind.conf' file on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: ERROR (refind_btrfs.service.snapshot_observer/snapshot_observer.py/run): An unexpected error happened, exiting...
Traceback (most recent call last):
File "/usr/lib/python3.12/site-packages/refind_btrfs/service/snapshot_observer.py", line 45, in run
self.dispatch_events(self.event_queue)
File "/usr/lib/python3.12/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
handler.dispatch(event)
File "/usr/lib/python3.12/site-packages/watchdog/events.py", line 283, in dispatch
{
File "/usr/lib/python3.12/site-packages/refind_btrfs/service/snapshot_event_handler.py", line 92, in on_created
machine.run()
File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/refind_btrfs_machine.py", line 102, in run
while model.next_state():
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 406, in trigger
return self.machine._process(func)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1236, in _process
return trigger()
^^^^^^^^^
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 420, in _trigger
self._process(event_data)
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 443, in _process
if trans.execute(event_data):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 278, in execute
self._change_state(event_data)
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 289, in _change_state
dest.enter(event_data)
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 130, in enter
event_data.machine.callbacks(self.on_enter, event_data)
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1171, in callbacks
self.callback(func, event_data)
File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1192, in callback
func(*event_data.args, **event_data.kwargs)
File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/model.py", line 173, in initialize_matched_boot_stanzas
refind_config = self.refind_config
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/model.py", line 380, in refind_config
return refind_config_provider.get_config(esp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/refind_btrfs/boot/file_refind_config_provider.py", line 104, in get_config
return self._read_config_from(none_throws(config_file_path))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/refind_btrfs/boot/file_refind_config_provider.py", line 226, in _read_config_from
persisted_refind_config = persistence_provider.get_refind_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/refind_btrfs/utility/shelve_persistence_provider.py", line 75, in get_refind_config
with shelve.open(self._db_filename) as local_db:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/shelve.py", line 227, in __init__
Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/dbm/__init__.py", line 95, in open
return mod.open(file, flag, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^
_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/refind-btrfs/local_db'
Sep 25 15:24:35 archlinux systemd[1]: refind-btrfs.service: Main process exited, code=exited, status=1/FAILURE
Sep 25 15:24:35 archlinux systemd[1]: refind-btrfs.service: Failed with result 'exit-code'.
lsblk
output:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 931.5G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 930.5G 0 part /.snapshots
/var
/home
/swap
/
sr0 11:0 1 1024M 0 rom
Other relevant files:
btrfs su l /
output: subvolume-list.txtESP/EFI/BOOT/refind.conf
: refind.txtESP/refind_linux.conf
: refind_linux.txt (probably not relevant, I disabled internal scan in favor of only having manual stanzas, but including for completeness)
Currently running a systemd path unit workaround:
/etc/systemd/system/refind-btrfs-workaround.path
[Unit]
Description = Monitoring /.snapshots for new Snapper entries
[Path]
PathModified = /.snapshots
Unit = refind-btrfs-workaround.service
[Install]
WantedBy = multi-user.target
/etc/systemd/system/refind-btrfs-workaround.service
[Unit]
Description = Populate rEFInd snapshot stanza
[Service]
Type = oneshot
ExecStart = /usr/bin/refind-btrfs
[Install]
WantedBy = multi-user.target
If you need any other logs or alike, just let me know.
So, it seems that the problem has something to do with one process reading the /var/lib/refind-btrfs/local_db
and another trying to write to it, at the same time. Why? I have no idea because I've never tried isolating the /var
directory into a subvolume of its own and I don't really see why that would somehow cause this issue.
Try disabling both services ("vanilla" and your own workaround service) and run refind-btrfs
"manually". Probably a stupid question but did you add a /var
entry in your fstab
file? I suppose you did because lsblk
prints info about it.
EDIT: As far as I can tell, your workaround service will run refind-btrfs
only once and then exit which isn't what's it supposed to do.
EDIT2: I'm opening and closing the local DB (really just a cache) multiple times, once per every method call. That should probably be changed but still, I have no explanation why it suddenly stopped working once you've introduced a separate subvolume for /var
.
Disabling (and stopping) all relevant services and running refind-btrfs
manually seems to work, tested 5 times by creating and deleting the same snapshot (n. 16).
Creation:
$ sudo refind-btrfs
Initializing the block devices using lsblk.
Initializing the physical partition table for device '/dev/loop0' using lsblk.
Initializing the live partition table for device '/dev/loop0' using findmnt.
Initializing the physical partition table for device '/dev/sda' using lsblk.
Initializing the live partition table for device '/dev/sda' using findmnt.
Initializing the physical partition table for device '/dev/sr0' using lsblk.
Initializing the live partition table for device '/dev/sr0' using findmnt.
Found the ESP mounted at '/boot' on '/dev/sda1'.
Found the root partition on '/dev/sda2'.
Found a separate boot partition on '/dev/sda1'.
Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Found subvolume '@' mounted as the root partition.
Found 14 snapshots of the '@' subvolume.
Searching for the 'refind.conf' file on '/dev/sda1'.
Found 1 boot stanza matched with the root partition.
Initializing the static partition table for subvolume '@snapshots/16/snapshot' from its fstab file.
Found 1 snapshot for addition.
Found 1 snapshot for removal.
Creating a new writable snapshot from the read-only '@snapshots/16/snapshot' snapshot at '/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231'.
Modifying the '/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231/etc/fstab' file.
Deleting the '@/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194' snapshot.
Writing to the 'btrfs-snapshot-stanzas/83417376-83a8-4856-bfa6-6427caf19824_vmlinuz-linux.conf' file.
Deletion:
❱ sudo refind-btrfs
Initializing the block devices using lsblk.
Initializing the physical partition table for device '/dev/loop0' using lsblk.
Initializing the live partition table for device '/dev/loop0' using findmnt.
Initializing the physical partition table for device '/dev/sda' using lsblk.
Initializing the live partition table for device '/dev/sda' using findmnt.
Initializing the physical partition table for device '/dev/sr0' using lsblk.
Initializing the live partition table for device '/dev/sr0' using findmnt.
Found the ESP mounted at '/boot' on '/dev/sda1'.
Found the root partition on '/dev/sda2'.
Found a separate boot partition on '/dev/sda1'.
Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Found subvolume '@' mounted as the root partition.
Found 13 snapshots of the '@' subvolume.
Searching for the 'refind.conf' file on '/dev/sda1'.
Found 1 boot stanza matched with the root partition.
Initializing the static partition table for subvolume '@snapshots/9/snapshot' from its fstab file.
Found 1 snapshot for addition.
Found 1 snapshot for removal.
Creating a new writable snapshot from the read-only '@snapshots/9/snapshot' snapshot at '/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194'.
Modifying the '/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194/etc/fstab' file.
Deleting the '@/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231' snapshot.
Writing to the 'btrfs-snapshot-stanzas/83417376-83a8-4856-bfa6-6427caf19824_vmlinuz-linux.conf' file.
I did edit the fstab
but forgot to include it in the OP:
# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d / btrfs rw,noatime,compress=zstd:3,space_cache=v2,subvolid=256,subvol=/@ 0 0
# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d /home btrfs rw,noatime,compress=zstd:3,space_cache=v2,subvolid=257,subvol=/@home 0 0
# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d /.snapshots btrfs rw,noatime,compress=zstd:3,space_cache=v2,subvolid=460,subvol=/@snapshots 0 0
# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d /var btrfs rw,noatime,compress=zstd:3,space_cache=v2,subvolid=1175,subvol=/@var 0 0
# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d /swap btrfs rw,noatime,compress=zstd:3,space_cache=v2,subvolid=1176,subvol=/@swap 0 0
# /dev/sda1
UUID=A150-9649 /boot vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro 0 2
# [SWAP]
/swap/swapfile none swap defaults 0 0
Regarding workaround systemd units: the .service
unit is only meant to be executed when started by its .path
companion, which is the one actually enabled and active in the background.
The only theory I have is that somehow multiple refind-btrfs
instances or multiple threads of its directory watching mode of operation run concurrently and Python's Shelve module is not thread-safe. Perhaps multiple snapshot addition and/or deletion events happen (almost) at the same time?
I could try and add locks around Shelve method calls or find a thread-safe alternative. Anyhow, I don't have a quick fix for this problem and I'll leave this issue open. Thanks for the detailed information.
@Venom1991 Chiming in. From poking around, I saw the watchdog's events are executed multithreaded. Which makes me think that snapshot addition/deletion is also happening async.
I did some hack workarounds by slapping a Lock around the main calls in these locations:
https://github.com/Venom1991/refind-btrfs/blob/master/src/refind_btrfs/service/snapshot_event_handler.py#L92
https://github.com/Venom1991/refind-btrfs/blob/master/src/refind_btrfs/service/snapshot_event_handler.py#L110
I gave these "hack" fixes to Kartofell and it seems to have fixed the issue for him.
Been 4 days of running nonstop without any issues. I be happy to shoot a PR if you think this is a fair way to fix this issue.
Sure, open a PR and I'll have a look at it. If it works, it works I suppose. The root cause might have something with this PEP. Perhaps the Watchdog maintainers chose to "turn off" the GIL for their library? I can't be sure but my code (the event handler which you've linked) didn't change in years.
@Venom1991 Sorry, I dont see any of the comments. Did you hit end review?