Unclear what to do after reverting state

Question

Unclear what to do after reverting state

josephtate opened this issue 4 years ago · 1 comments

Describe the bug
Perhaps this is a documentation issue, but it's unclear what the admin needs to do after booting grub from an old snapshot to keep their system working smoothly.

I tried to install a Real Time kernel to do some Ubuntu Studio work, but that was unable to load my zfs pools. So I reverted. Now I have two sets of zfs snapshots, and worse still, several zfs and zsys services don't work, zsys boot-prepare seg faults, and I don't have confidence in the system anymore.

To Reproduce
Steps to reproduce the behavior:

Install Ubuntu + ZFS root
Download the ubuntu studio installer and install the real time kernel
Reboot, see zfs failure
Reboot and rollback via the grub menu to the previous snapshot
What do I do next?

Expected behavior
I was expecting for there to be some sort of zsys permanesce command that would roll back the system zfs states to the current clone and delete the original. Something that would run zfs promote, for example and delete the other branch.

For ubuntu users, please run and copy the following:

ubuntu-bug zsys --save=/tmp/report
Copy paste below /tmp/report content:
I was unable to generate the report as directed:

$ sudo ubuntu-bug zsys --save=/tmp/report

*** Collecting problem information

The collected information can be sent to the developers to improve the
application. This might take a few minutes.
.......

*** Problem in zsys

The problem cannot be reported:

This is not an official KDE package. Please remove any third party package and try again.

Press any key to continue... 

No pending crash reports. Try --help for more information.

Screenshots
If applicable, add screenshots to help explain your problem.

Installed versions:

OS:

$ cat /etc/os-release 
NAME="KDE neon Plasma LTS"
VERSION="5.18"
ID=neon
ID_LIKE="ubuntu debian"
PRETTY_NAME="KDE neon Plasma LTS Edition 5.18"
VARIANT="Plasma LTS Edition"
VERSION_ID="20.04"
HOME_URL="https://neon.kde.org/"
SUPPORT_URL="https://neon.kde.org/"
BUG_REPORT_URL="https://bugs.kde.org/"
LOGO=start-here-kde-neon
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Zsysd running version: zsysctl 0.4.8

Additional context
Add any other context about the problem here.
Mount shows the following datasets loaded:

rpool/ROOT/ubuntu_r20rzf on / type zfs (rw,relatime,xattr,posixacl)
rpool/USERDATA/username_aqwu6c on /home/jtate type zfs (rw,relatime,xattr,posixacl)
rpool/USERDATA/root_03fo29tr on /root type zfs (rw,relatime,xattr,posixacl)
bpool/BOOT/ubuntu_r20rzf on /boot type zfs (rw,nodev,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/games on /var/games type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/www on /var/www type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/log on /var/log type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib on /var/lib type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/usr/local on /usr/local type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/snap on /var/snap type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/spool on /var/spool type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/srv on /srv type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/mail on /var/mail type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib/dpkg on /var/lib/dpkg type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib/NetworkManager on /var/lib/NetworkManager type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib/AccountsService on /var/lib/AccountsService type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib/apt on /var/lib/apt type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu_r20rzf/var/lib/0b3174a11e50edb014a03ca2efa4fddfa481f781a2ff233c785668d42c3dac72 on /var/lib/docker/zfs/graph/0b3174a11e50edb014a03ca2efa4fddfa481f781a2ff233c785668d42c3dac72 type zfs (rw,relatime,xattr,posixacl)

But I had to zfs mount most of those.

$ systemctl status zsys*
● zsys-gc.timer - Clean up old snapshots to free space
     Loaded: loaded (/lib/systemd/system/zsys-gc.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2021-01-07 00:07:36 EST; 4 days ago
    Trigger: Tue 2021-01-12 23:09:46 EST; 23h left
   Triggers: ● zsys-gc.service

Jan 07 00:07:36 denali.int.dragonstrider.com systemd[1]: Started Clean up old snapshots to free space.

● zsysd.socket - Socker activation for zsys daemon
     Loaded: loaded (/lib/systemd/system/zsysd.socket; enabled; vendor preset: enabled)
     Active: failed (Result: service-start-limit-hit) since Thu 2021-01-07 00:10:22 EST; 4 days ago
   Triggers: ● zsysd.service
     Listen: /run/zsysd.sock (Stream)

Jan 07 00:07:36 denali.int.dragonstrider.com systemd[1]: Listening on Socker activation for zsys daemon.
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: zsysd.socket: Failed with result 'service-start-limit-hit'.

● zsysd.service - ZSYS daemon service
     Loaded: loaded (/lib/systemd/system/zsysd.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Thu 2021-01-07 00:10:22 EST; 4 days ago
TriggeredBy: ● zsysd.socket
   Main PID: 13566 (code=exited, status=2)

Jan 07 00:10:22 denali.int.dragonstrider.com zsysd[13566]: github.com/ubuntu/zsys/vendor/github.com/spf13/cobra.(*Command).Execute(...)
Jan 07 00:10:22 denali.int.dragonstrider.com zsysd[13566]:         github.com/ubuntu/zsys/vendor/github.com/spf13/cobra/command.go:864
Jan 07 00:10:22 denali.int.dragonstrider.com zsysd[13566]: main.main()
Jan 07 00:10:22 denali.int.dragonstrider.com zsysd[13566]:         github.com/ubuntu/zsys/cmd/zsysd/main.go:36 +0xdb
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: zsysd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: zsysd.service: Failed with result 'exit-code'.
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: Failed to start ZSYS daemon service.
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: zsysd.service: Start request repeated too quickly.
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: zsysd.service: Failed with result 'exit-code'.
Jan 07 00:10:22 denali.int.dragonstrider.com systemd[1]: Failed to start ZSYS daemon service.

● zsys-gc.service - Clean up old snapshots to free space
     Loaded: loaded (/lib/systemd/system/zsys-gc.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2021-01-11 23:09:46 EST; 26min ago
TriggeredBy: ● zsys-gc.timer
   Main PID: 2164204 (code=exited, status=1/FAILURE)

Jan 11 23:09:46 denali.int.dragonstrider.com systemd[1]: Starting Clean up old snapshots to free space...
Jan 11 23:09:46 denali.int.dragonstrider.com zsysctl[2164204]: level=error msg="couldn't connect to zsys daemon: connection error: desc = \"transport: Error while dialing dial unix /run/zsysd.sock: connect: connection refused\""
Jan 11 23:09:46 denali.int.dragonstrider.com systemd[1]: zsys-gc.service: Main process exited, code=exited, status=1/FAILURE
Jan 11 23:09:46 denali.int.dragonstrider.com systemd[1]: zsys-gc.service: Failed with result 'exit-code'.
Jan 11 23:09:46 denali.int.dragonstrider.com systemd[1]: Failed to start Clean up old snapshots to free space.

Answer 1 · 2021-01-31T10:23:38.000Z

Well, I think I have fixed my system.

Not all these steps are necessary, but I thought starting from a clean slate would be faster than preserving zsys history or docker images (for example). Hopefully this can help someone.

I stopped docker, removed /var/lib/docker and all its contents. Docker complicates the zfs layout, and I only run one thing anyway.
- I destroyed all docker related zfs datasets and snapshots
Then I figured out which of the two sets of datasets were in use (in my case, 03fo29tr was redundant, and r20rzf was the working system). Look at the output of df or mount to figure out which is mounted
Start with bpool: zfs promote bpool/BOOT/ubuntu_r20rzf
zfs destroy -R bpool/BOOT/ubuntu_03fo29tr
Then I had to do this for every mounted filesystem:
- zfs promote rpool/ROOT/ubuntu_r20rzf/<mountpoint>
Now I could run zfs destroy on each of the redundant datasets. I didn't use -r or -R, so I had to list each, one at a time.
finally I ran zfs destroy on all the remaining autozsys snapshots that I could, leaving only one for the currently booted system.
I ran apt install --reinstall linux-image-5.4.0-65-generic to make sure that I had a good initramfs, and besides the normal noise about missing encryption set up (I didn't set up encryption), the output looked ok. There are two warnings that I need to resolve still, I think, but I'll work on those

My rpool/USERDATA/root_ is still the old id, but that doesn't seem to matter.

I rebooted, but zfs-mount service still was failing to come up. zfs mount -a was giving me errors about / not being empty.

I rebooted using an installer USB flash drive
I ran zfs import rpool -R /system to mount the rpool.
Then I tried running zfs mount -a, clearing non-empty directories until it completed successfully.
Then I ran zfs export rpool to unmount and export the dataset.
When I rebooted again, it was still failing. I checked my canmount and mountpoint properties and found that I had two zfs datasets with / as the mountpoint, so I set one to "none".
I still had some more mount problems, but eventually deleted all the non-empty directories so zfs mount -a completed without a core dump or error messages.

BUT I still had problems in systemd: the zysys-commit service was not starting, but the workaround in #112 helped me get that running too.