Venom1991/refind-btrfs

_gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/refind-btrfs/local_db'

Closed this issue · 11 comments

Currently running into the aforementioned issue when refind-btrfs.service gets executed. Started happening after I moved /var to its own @var subvolume (copied all children with cp --archive -r --reflink=always). Deleting /var/lib/refind-btrfs/local_db and restarting refind-btrfs.service works once or twice before failing again, tested by creating a snapshot with snapper -c root create --description "test" and deleting it (a couple of times).

Package versions:

  • refind-btrfs 0.6.2-1
  • refind 0.14.2-1
  • snapper 0.11.2-1

journalctl -u refind-btrfs.service excerpt (full log: refind-btrfs.log)

-- Boot 8a3354baeed148c69957320a03564c58 --
Sep 25 15:24:10 archlinux systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Sep 25 15:24:12 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/2, /.snapshots/3, /.snapshots/4, /.snapshots/5, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Sep 25 15:24:12 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Starting the observer with PID 1112.
Sep 25 15:24:12 archlinux systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: The '/.snapshots/15' snapshot has been created.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the block devices using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the physical partition table for device '/dev/sda' using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the live partition table for device '/dev/sda' using findmnt.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the physical partition table for device '/dev/sr0' using lsblk.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Initializing the live partition table for device '/dev/sr0' using findmnt.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found the ESP mounted at '/boot' on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found the root partition on '/dev/sda2'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found a separate boot partition on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found subvolume '@' mounted as the root partition.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Found 15 snapshots of the '@' subvolume.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: Searching for the 'refind.conf' file on '/dev/sda1'.
Sep 25 15:24:35 archlinux /usr/lib/python3.12/site-packages/refind_btrfs/__main__.py[1112]: ERROR (refind_btrfs.service.snapshot_observer/snapshot_observer.py/run): An unexpected error happened, exiting...
                                                                                            Traceback (most recent call last):
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/service/snapshot_observer.py", line 45, in run
                                                                                                self.dispatch_events(self.event_queue)
                                                                                              File "/usr/lib/python3.12/site-packages/watchdog/observers/api.py", line 381, in dispatch_events
                                                                                                handler.dispatch(event)
                                                                                              File "/usr/lib/python3.12/site-packages/watchdog/events.py", line 283, in dispatch
                                                                                                {
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/service/snapshot_event_handler.py", line 92, in on_created
                                                                                                machine.run()
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/refind_btrfs_machine.py", line 102, in run
                                                                                                while model.next_state():
                                                                                                      ^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 406, in trigger
                                                                                                return self.machine._process(func)
                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1236, in _process
                                                                                                return trigger()
                                                                                                       ^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 420, in _trigger
                                                                                                self._process(event_data)
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 443, in _process
                                                                                                if trans.execute(event_data):
                                                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 278, in execute
                                                                                                self._change_state(event_data)
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 289, in _change_state
                                                                                                dest.enter(event_data)
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 130, in enter
                                                                                                event_data.machine.callbacks(self.on_enter, event_data)
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1171, in callbacks
                                                                                                self.callback(func, event_data)
                                                                                              File "/usr/lib/python3.12/site-packages/transitions/core.py", line 1192, in callback
                                                                                                func(*event_data.args, **event_data.kwargs)
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/model.py", line 173, in initialize_matched_boot_stanzas
                                                                                                refind_config = self.refind_config
                                                                                                                ^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/state_management/model.py", line 380, in refind_config
                                                                                                return refind_config_provider.get_config(esp)
                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/boot/file_refind_config_provider.py", line 104, in get_config
                                                                                                return self._read_config_from(none_throws(config_file_path))
                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/boot/file_refind_config_provider.py", line 226, in _read_config_from
                                                                                                persisted_refind_config = persistence_provider.get_refind_config(
                                                                                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/site-packages/refind_btrfs/utility/shelve_persistence_provider.py", line 75, in get_refind_config
                                                                                                with shelve.open(self._db_filename) as local_db:
                                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/shelve.py", line 243, in open
                                                                                                return DbfilenameShelf(filename, flag, protocol, writeback)
                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/shelve.py", line 227, in __init__
                                                                                                Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
                                                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                              File "/usr/lib/python3.12/dbm/__init__.py", line 95, in open
                                                                                                return mod.open(file, flag, mode)
                                                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                                                            _gdbm.error: [Errno 11] Resource temporarily unavailable: '/var/lib/refind-btrfs/local_db'
Sep 25 15:24:35 archlinux systemd[1]: refind-btrfs.service: Main process exited, code=exited, status=1/FAILURE
Sep 25 15:24:35 archlinux systemd[1]: refind-btrfs.service: Failed with result 'exit-code'.

lsblk output:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 931.5G  0 disk 
├─sda1   8:1    0     1G  0 part /boot
└─sda2   8:2    0 930.5G  0 part /.snapshots
                                 /var
                                 /home
                                 /swap
                                 /
sr0     11:0    1  1024M  0 rom  

Other relevant files:

  • btrfs su l / output: subvolume-list.txt
  • ESP/EFI/BOOT/refind.conf: refind.txt
  • ESP/refind_linux.conf: refind_linux.txt (probably not relevant, I disabled internal scan in favor of only having manual stanzas, but including for completeness)

Currently running a systemd path unit workaround:
/etc/systemd/system/refind-btrfs-workaround.path

[Unit]
Description = Monitoring /.snapshots for new Snapper entries

[Path]
PathModified = /.snapshots
Unit = refind-btrfs-workaround.service

[Install]
WantedBy = multi-user.target

/etc/systemd/system/refind-btrfs-workaround.service

[Unit]
Description = Populate rEFInd snapshot stanza

[Service]
Type = oneshot
ExecStart = /usr/bin/refind-btrfs

[Install]
WantedBy = multi-user.target

If you need any other logs or alike, just let me know.

So, it seems that the problem has something to do with one process reading the /var/lib/refind-btrfs/local_db and another trying to write to it, at the same time. Why? I have no idea because I've never tried isolating the /var directory into a subvolume of its own and I don't really see why that would somehow cause this issue.

Try disabling both services ("vanilla" and your own workaround service) and run refind-btrfs "manually". Probably a stupid question but did you add a /var entry in your fstab file? I suppose you did because lsblk prints info about it.

EDIT: As far as I can tell, your workaround service will run refind-btrfs only once and then exit which isn't what's it supposed to do.

EDIT2: I'm opening and closing the local DB (really just a cache) multiple times, once per every method call. That should probably be changed but still, I have no explanation why it suddenly stopped working once you've introduced a separate subvolume for /var.

Disabling (and stopping) all relevant services and running refind-btrfs manually seems to work, tested 5 times by creating and deleting the same snapshot (n. 16).

Creation:

$ sudo refind-btrfs
Initializing the block devices using lsblk.
Initializing the physical partition table for device '/dev/loop0' using lsblk.
Initializing the live partition table for device '/dev/loop0' using findmnt.
Initializing the physical partition table for device '/dev/sda' using lsblk.
Initializing the live partition table for device '/dev/sda' using findmnt.
Initializing the physical partition table for device '/dev/sr0' using lsblk.
Initializing the live partition table for device '/dev/sr0' using findmnt.
Found the ESP mounted at '/boot' on '/dev/sda1'.
Found the root partition on '/dev/sda2'.
Found a separate boot partition on '/dev/sda1'.
Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Found subvolume '@' mounted as the root partition.
Found 14 snapshots of the '@' subvolume.
Searching for the 'refind.conf' file on '/dev/sda1'.
Found 1 boot stanza matched with the root partition.
Initializing the static partition table for subvolume '@snapshots/16/snapshot' from its fstab file.
Found 1 snapshot for addition.
Found 1 snapshot for removal.
Creating a new writable snapshot from the read-only '@snapshots/16/snapshot' snapshot at '/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231'.
Modifying the '/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231/etc/fstab' file.
Deleting the '@/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194' snapshot.
Writing to the 'btrfs-snapshot-stanzas/83417376-83a8-4856-bfa6-6427caf19824_vmlinuz-linux.conf' file.

Deletion:

❱ sudo refind-btrfs
Initializing the block devices using lsblk.
Initializing the physical partition table for device '/dev/loop0' using lsblk.
Initializing the live partition table for device '/dev/loop0' using findmnt.
Initializing the physical partition table for device '/dev/sda' using lsblk.
Initializing the live partition table for device '/dev/sda' using findmnt.
Initializing the physical partition table for device '/dev/sr0' using lsblk.
Initializing the live partition table for device '/dev/sr0' using findmnt.
Found the ESP mounted at '/boot' on '/dev/sda1'.
Found the root partition on '/dev/sda2'.
Found a separate boot partition on '/dev/sda1'.
Searching for snapshots of the '@' subvolume in the '/.snapshots' directory.
Found subvolume '@' mounted as the root partition.
Found 13 snapshots of the '@' subvolume.
Searching for the 'refind.conf' file on '/dev/sda1'.
Found 1 boot stanza matched with the root partition.
Initializing the static partition table for subvolume '@snapshots/9/snapshot' from its fstab file.
Found 1 snapshot for addition.
Found 1 snapshot for removal.
Creating a new writable snapshot from the read-only '@snapshots/9/snapshot' snapshot at '/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194'.
Modifying the '/root/.refind-btrfs/rwsnap_2024-09-25_15-11-08_ID1194/etc/fstab' file.
Deleting the '@/root/.refind-btrfs/rwsnap_2024-09-25_16-58-19_ID1231' snapshot.
Writing to the 'btrfs-snapshot-stanzas/83417376-83a8-4856-bfa6-6427caf19824_vmlinuz-linux.conf' file.

I did edit the fstab but forgot to include it in the OP:

# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d	/         	btrfs     	rw,noatime,compress=zstd:3,space_cache=v2,subvolid=256,subvol=/@	0 0

# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d	/home     	btrfs     	rw,noatime,compress=zstd:3,space_cache=v2,subvolid=257,subvol=/@home	0 0

# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d	/.snapshots	btrfs     	rw,noatime,compress=zstd:3,space_cache=v2,subvolid=460,subvol=/@snapshots	0 0

# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d	/var      	btrfs     	rw,noatime,compress=zstd:3,space_cache=v2,subvolid=1175,subvol=/@var	0 0

# /dev/sda2
UUID=2d8fc74f-870e-42aa-bcc9-6a4cf3a1f01d	/swap     	btrfs     	rw,noatime,compress=zstd:3,space_cache=v2,subvolid=1176,subvol=/@swap	0 0

# /dev/sda1
UUID=A150-9649      	/boot     	vfat      	rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro	0 2

# [SWAP]
/swap/swapfile  none    swap    defaults    0   0

Regarding workaround systemd units: the .service unit is only meant to be executed when started by its .path companion, which is the one actually enabled and active in the background.

The only theory I have is that somehow multiple refind-btrfs instances or multiple threads of its directory watching mode of operation run concurrently and Python's Shelve module is not thread-safe. Perhaps multiple snapshot addition and/or deletion events happen (almost) at the same time?

I could try and add locks around Shelve method calls or find a thread-safe alternative. Anyhow, I don't have a quick fix for this problem and I'll leave this issue open. Thanks for the detailed information.

@Venom1991 Chiming in. From poking around, I saw the watchdog's events are executed multithreaded. Which makes me think that snapshot addition/deletion is also happening async.

I did some hack workarounds by slapping a Lock around the main calls in these locations:
https://github.com/Venom1991/refind-btrfs/blob/master/src/refind_btrfs/service/snapshot_event_handler.py#L92
https://github.com/Venom1991/refind-btrfs/blob/master/src/refind_btrfs/service/snapshot_event_handler.py#L110

I gave these "hack" fixes to Kartofell and it seems to have fixed the issue for him.
Been 4 days of running nonstop without any issues. I be happy to shoot a PR if you think this is a fair way to fix this issue.

@Crec0

Sure, open a PR and I'll have a look at it. If it works, it works I suppose. The root cause might have something with this PEP. Perhaps the Watchdog maintainers chose to "turn off" the GIL for their library? I can't be sure but my code (the event handler which you've linked) didn't change in years.

@Crec0

Hi, I've added a couple of comments to your PR.

@Venom1991 Sorry, I dont see any of the comments. Did you hit end review?

@Crec0

I did now. 😂 It's a bit more clicking than on Bitbucket or GitLab.

@Crec0

Merged the PR, I'll try releasing it tomorrow.

@Crec0

Merged the PR, I'll try releasing it tomorrow.

Thanks! No rush at all!

Fixed in release 0.6.3.