Antynea/grub-btrfs

Unable to boot from snapshot on Fedora 35 after grub update

crankedguy opened this issue · 25 comments

Decided to open a new issue and reference it from her, this is better also in terms of documentation, as the other one is closed already, please see also further comments

Hi,

I will write onto this thread here because there is definitely "something" going on. The OP wrote a couple of weeks ago when the config file changed.
For me it was yesterday here on F35 and it is definitely exactly the same behaviour, and absolutely non-resolvable on a production system. You can do whatever you want, reinstall all grub and shim, regenerate grub.cfg etc., it always leads to Unknown TPM error. I did not find a single solution within hours for this and so have to scratch the whole system, as also new snapshots will not work anymore. It is just dead.

Here it was caused due to a recent grub2 update, which actually should have nothing to do with it at all.
grub2-common is just an example, all grub- and shim- packages were updated

Changelogs for grub2-common-1:2.06-10.fc35.noarch
* Fri Dec 10 00:00:00 2021 Robbie Harwood <rharwood@redhat.com> - 2.06-10
- Bump to re-do signing (no code changes)

* Thu Dec 09 00:00:00 2021 Robbie Harwood <rharwood@redhat.com> - 2.06-9
- Restore grub.cfg umask (CVE-2021-3981)
- Resolves: rhbz#2030358

snapshot handling is gone with it. And I don't have the slightest idea....

@Antynea : As this seems to have to do something with not the most recent change but in general with updates of either grub- or shim-packages, do you have any clue?

Originally posted by @crankedguy in #116 (comment)

hello,

I just tested again on Fedora 35:

# rpm -qa | grep grub2
grub2-common-2.06-10.fc35.noarch
grub2-tools-minimal-2.06-10.fc35.x86_64
grub2-tools-2.06-10.fc35.x86_64
grub2-pc-modules-2.06-10.fc35.noarch
grub2-pc-2.06-10.fc35.x86_64
grub2-efi-ia32-2.06-10.fc35.x86_64
grub2-efi-x64-2.06-10.fc35.x86_64
grub2-tools-extra-2.06-10.fc35.x86_64
grub2-efi-ia32-cdboot-2.06-10.fc35.x86_64
grub2-efi-x64-cdboot-2.06-10.fc35.x86_64
grub2-tools-efi-2.06-10.fc35.x86_64

latest version of grub-btrfs (4.11)

Details of the installation:
Note: i dont have any TPM module.

Hdd partitions:
Type mbr
/dev/sda1 /boot ext4
/dev/sda2 / btrfs

3 subvolumes:
/root
/home
@snapshots

Snapshots menu appears in grub, and allows to boot.

Perhaps you have a different configuration, could you share it?
Specified especially if you are in UEFI and TPM enabled.

Thanks for your reply! Did you also read the further comments in the closed issue? #116 (comment)

You have to put it in context to the other issue, and also what I wrote in the bugzilla issue. It works without any issues with the 2.06-10 version, that is if you reinstall the whole system.
What does not work though is, there was 2.06-8 installed on the system and it was updated to 2.06-10. After the update everything broke.
It has something to do with the updates of the grub-packages on fedora. They seem to mess around with something around the EFI partitions and EFI/TPM connection. It is complete unclear what this could be to me honestly, but it is very clear from the other issue that OP had a similar situation (he said config file changed).
Fact is after the update you just have the entries left but cannot do anything anymore with them.

So, 2.06-8, everything works fine, 2.06-10, everything works fine.

Update from 2.06-8 to 2.06-10 and all snapshots are inaccessible to boot. And this seems to be a general thing when grub-packages are updated on Fedora.
Why are the kernels not found anymore all of a sudden? Something happens there.

UEFI and TPM enabled. TPM you cannot switch off completely on AMD TR3960X, at least not on my MB.

Config :

/dev/nvme0n1p1 /boot/efi FAT32
/dev/nvme0n1p2 /boot BTRFS
/dev/nvme0n1p3 / BTRFS
/dev/nvme0n1p4 /home BTRFS

subvolumes : 
on /dev/nvme0n1p2
@boot 
@snapshots

on /dev/nvme0n1p3
@
@var
@usrlocal
@srv
@opt
@root
@snapshots
@swap

on /dev/nvme0n1p4
@home
@snapshots

bind mount of /usr/var/lib to /var/lib which provides full snapshots of /var/lib when a @ snapshot is done,
and therefore full functionality when booting into snapshots. 
/usr/var/lib is created with contents of /var/lib during install and then immediately bind-mounted. @var/lib contents 
are deleted afterwards

Except the /boot and /boot/efi all is encrypted with LUKS2
I think you should forget the rather extensive configuration in terms of subvolumes, it is not that there is an issue with the general function/workflow as said. Everything works totally fine and as it should. One of the best configs I ever did for my purpose.
Booting into ro snapshots, grub menu everything...

That is, until you update grub, then it is done... Same error behavior than OP had in the closed issue, and I am sure he did not have a "messed" up system in any way since yesterday.

Thank you for the clarification, I didn't quite understand, sorry.
Have you tried to unload the tpm module by editing the snapshot menu entry ?

    menuentry '  vmlinuz-5.15.6-200.fc35.x86_64 & initramfs-5.15.6-200.fc35.x86_64.img' --class snapshots --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-snapshots-df455913-4219-4083-85ea-5f91c34848c2' {
        if [ x$feature_all_video_module = xy ]; then
        insmod all_video
        rmmod tpm # <------ add this line here
        fi
        set gfxpayload=keep
        insmod ext2
        if [ x$feature_platform_search_hint = xy ]; then
            search --no-floppy --fs-uuid  --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  df455913-4219-4083-85ea-5f91c34848c2
        else
            search --no-floppy --fs-uuid  --set=root df455913-4219-4083-85ea-5f91c34848c2
        fi
        echo 'Loading Snapshot: 2021-12-04 11:18:08 root/.snapshots/1/snapshot'
        echo 'Loading Kernel: vmlinuz-5.15.6-200.fc35.x86_64 ...'
        linux "/vmlinuz-5.15.6-200.fc35.x86_64" root=UUID=23eda1b9-56b7-4387-b9cb-1b3e2178aa8f rhgb quiet  rootflags=compress=zstd:1,subvol="root/.snapshots/1/snapshot"
        echo 'Loading Initramfs: initramfs-5.15.6-200.fc35.x86_64.img ...'
        initrd "/initramfs-5.15.6-200.fc35.x86_64.img"
    }

If it works, it could be a workaround ...

No I did not try that one. The big problem is I have no access to this machine config anymore as this is a prod workstation and I had to reinstall it yesterday so I could work, which was a real hassle (I have snapshots for that normally lol)

I will boot into the fresh install later and have to check if the tpm module in the kernel is even activated now.
Until this is not clarified I will have to freeze the grub- packages from updates so this does not happen again.

I might have to make another bootable installation for further tests then as I don't see how you could simulate that in a VM in any reliable way. I will post if I have some more information on that matter

So the thing is this will be no workaround as there is no tpm module in the kernel enabled. I was missing that I had looked after that already yesterday, it was late.

[machine ~]# lsmod | grep -i tpm
[machine ~]# 

[machine ~]# dmesg | grep -i tpm
[    0.000000] efi: ACPI=0xa9aee000 ACPI 2.0=0xa9aee014 TPMFinalLog=0xa9c12000 SMBIOS=0xab030000 SMBIOS 3.0=0xab02f000 MEMATTR=0xa1a4c018 ESRT=0xa7032998 MOKvar=0xa136d000 RNG=0xab039b18 TPMEventLog=0x9d412018 
[    0.000000] ACPI: TPM2 0x00000000A9AC3000 00004C (v04 AMD    A M I    00000001 AMI  00000000)
[    0.000000] ACPI: Reserving TPM2 table memory at [mem 0xa9ac3000-0xa9ac304b]

the initramfs has these modules in it:
[machine ~]# lsinitrd /boot/initramfs-5.15.6-200.fc35.x86_64.img | grep -i tpm
-rw-r--r--   1 root     root         6832 Oct 28 21:55 usr/lib/modules/5.15.6-200.fc35.x86_64/kernel/crypto/asymmetric_keys/asym_tpm.ko.xz
-rw-r--r--   1 root     root         2116 Oct 28 21:55 usr/lib/modules/5.15.6-200.fc35.x86_64/kernel/crypto/asymmetric_keys/tpm_key_parser.ko.xz
[machine ~]# 

The TPM is activated in the CPU, yes as you can see in the boot. But nothing is done with it. I do not unlock anything with it.

So, this makes the error even more strange. Honestly I don't know really where to go from here. What I could do is to create another bare-metal install,installing 2.06-6 packages (2.06-8 packages are not installable via dnf), put a dnf versionlock onto it, then finish the install, get some snapshots, try to boot into them, and then update the grub-* packages, boot into the active version (which still is possible) and compare the whole /boot with the pre snapshot. But the thing is I don't know what I should get out of this, of course the files will be different and I do not think that I will see anything in config files, these will be binary changes.

Ah, tpm module is builtin, so you cannot remove it with rmmod, nor modprobe it anyways. So some of my assumptions were wrong, it is there and it is not to be removed in any way as a workaround...

Grub is modular.
On fedora, Grub is compiled to load the tpm module (not to be confused with the kernel).
The "rmmod tpm" directive tells Grub not to load its module for this entry.

Ah I misunderstood. Ok will try it when the opportunity comes up. Because right now I have no luck in replicating. From an existing install I tried to downgrade grub2-* and then later update it, reconfiguring with grub2-mkconfig, nothing leads to this error.
The only thing left is a fresh install with a lower grub2-* version as said before and then upgrading it. This would be essentially what happened on the prod machine, which was relatively fresh otherwise, but not completely fresh. If this also does not lead to the error it will be a game keeping Fedora in this constellation. I don't like games.

LOL I don't know why I still work professionally in IT, giving me all this from day to day.
I did the following

I had 2.06-10 installed
downgrade to 2.06-6 reboot ... ok
upgrade to 2.06-10 reboot ... ok
regenerate grub.cfg manually reboot ... ok
change /etc/default/grub, regenerate reboot ... ok
change /etc/default/grub differently, regenerate reboot ... ok
reinstall shim* reboot ... ok
reinstall grub2-common reboot ... ok
downgrade grub2* to 2.06-6 again, versionlock grub2* reboot ... UNKNOWN TPM ERROR, unfixable, out of the blue...
but main version bootable
login check everything in /boot and /boot/efi against previously saved, everything ok 
except what I had changed on configs
reinstall shim* grub2* reboot ... UNKNOWN TPM ERROR, but main version bootable...
rmmod tpm reboot ... UNNKNOWN TPM ERROR

What did the two things have in common then? It happened on the prod machine on the update, it happened here on a downgrade, everything looks fine...
It has nothing to with grub version? Why is the main version still bootable without Unknown TPM error, the snapshots not?

Yea well right, the only thing the two had in common was : Really a lot of menu entries built up in the snapshot submenu
Ok, I delete half of them manually...
Unknown TPM error gone...for good
This is one of the worst bugs I have ever seen in my life...
If a submenu gets too big grub reacts with Unknown TPM errror...
This makes my whole life fulfilled now...

I saw that you implemented a warning when 250 menu entries are exceeded. I can tell you that this whole thing happens even before reaching 40, at least on Fedora and in my infrastructure. I did not count how many exactly and I did not test on other systems, I just removed a bunch of them and everything worked again.
Another hint that this is a direct grub issue is that you don't even come to any code where you can try something like rmmod or anything.
Because the error occurs already when you try to enter the submenu of the snapshot itself, that means before you even reach the initramfs entries in the menus that directly initiate the boot process after selecting
This might have to do with Fedora's very own BLSCFG story and the changes they implemented for that on top of vanilla...

Hello,

I installed Fedora 35 with uefi and tpm 2.0.
I easily reproduced the bug.

Installed from Fedora-Workstation-Live-x86_64-35-1.2.iso
I only installed grub-btrfs, and this command grub2-editenv - unset menu_auto_hide to display the Grub menu at boot.
I didn't have to juggle the different Grub packages like you.
On fresh install bug is present.

# rpm -qa | grep grub2
grub2-tools-minimal-2.06-6.fc35.x86_64
grub2-common-2.06-6.fc35.noarch
grub2-tools-2.06-6.fc35.x86_64
grub2-efi-ia32-2.06-6.fc35.x86_64
grub2-efi-x64-2.06-6.fc35.x86_64
grub2-pc-modules-2.06-6.fc35.noarch
grub2-pc-2.06-6.fc35.x86_64
grub2-tools-extra-2.06-6.fc35.x86_64
grub2-tools-efi-2.06-6.fc35.x86_64
grub2-efi-ia32-cdboot-2.06-6.fc35.x86_64
grub2-efi-x64-cdboot-2.06-6.fc35.x86_64

# uname -a
Linux fedora 5.14.10-300.fc35.x86_64 #1 SMP Thu Oct 7 20:48:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

My idea to disable tpm works.
But as you explained, the bug is present as soon as the grub-btrfs file is loaded, not after, which changes everything.

So you need to add rmmod tpm when loading the grub-btrfs file.

rmmod.tpm.mp4

This could be a solution for those who are willing to unload the tmp module, for the others, I have no idea at the moment.

Hi,
thanks for your reply. Yes, of course you do not have to juggle around. I just juggled around because the thought was that it has to do with the package updates. Which is clearly not the case from what I found out through that.
This might be a very longstanding bug already.

I see 2 cases for now :
a) You disable TPM like you proposed
b) You stay under 20 snapshots that are displayed in the menu. It seems to work well below around 20. I did not test one after the other until it died. But every test I did below about 20 worked actually.

The question is where to address it at the long-term. I clearly see it as a grub issue.

Did you try it on other distros with TPM enabled? Is there the same error? If yes, it would be to be reported to grub upstream, otherwise it should be reported to Fedora, as it then would have to do with Fedora's own changes on grub.
After all the grub-btrfs file is no special file, it is just a grub menu.

Or do you see it as grub-btrfs specific yourself?

Or do you see it as grub-btrfs specific yourself ?

As things stand, I wonder.
I am not familiar with TPM.
Do I have to sign the grub-btrfs.cfg file?
Is the bug also present on another distribution?

I'm thinking of trying it with ubuntu when I have time.

Well, honestly from my POV : it if would be a signage issue it would not work at all, never. Don't you think so? It would be my understanding at least.
It works well if you only have a few snapshots, there are no issues whatsoever then. No boot, problems, no grub problems. If you exceed a certain threshold you suddenly get a TPM error.
This rather looks like a very strange bug to me.

And the cfg file is just integrated into the standard grub.cfg. To me this is only a submenu, nothing more. What would be if you write it directly into grub.cfg. There would be no difference in my eyes.

Well, honestly from my POV : it if would be a signage issue it would not work at all, never. Don't you think so? It would be my understanding at least.
It works well if you only have a few snapshots, there are no issues whatsoever then. No boot, problems, no grub problems. If you exceed a certain threshold you suddenly get a TPM error.
This rather looks like a very strange bug to me.

I agree with your thoughts.

And the cfg file is just integrated into the standard grub.cfg. To me this is only a submenu, nothing more. What would be if you write it directly into grub.cfg. There would be no difference in my eyes.

Exactly, the bug would occur earlier, as the grub.cfg file would be filled with too many entries, preventing us from booting on the standard entry.

I haven't found any other distros that include tpm in Grub natively. Not knowing how Fedora has implemented it, it is difficult to do the same thing on another distro.

The question is where to address it at the long-term. I clearly see it as a grub issue.

I agree, it's a Grub bug.

Just want to say that this bug is still present.

Differences: I'm on Arch instead of Fedora. My boot menu for Grub is approximately 12 items with the only subfolder being snapshots. The rmmod tpm workaround does not work for me.

Similarities: Grub Boots regular images fine. Only snapshots are an issue.

Has there been any developments on this bug?

This bug seems to be present on Fedora 36 but not on EndeavourOS, an arch derivative with pre-built grub.

In endeavour I could keep ftpm on with secure boot disabled no issues. With Fedora, I had to turn ftpm off (I don't really use it for anything so meh), but I can't help but feel like it might be useful in the future. Worth noting that in Endeavour I got as high as 100 snapshots with no issue. With Fedora I was seeing issues well below 20 snapshots. I don't think snapshot count matters.

My workaround when I did have ftpm on was to enable secure boot, boot into the Fedora, then disable it again after reboot. Fedora worked just fine from that point on. But requires rinse and repeat after every snapshot.

this may just be speculation but I think its some kind of text/buffer overflow. From what I have reviewed and tested (using Nobara 36) this bug did not occur for me with the default limit of 50 snapshots if I changed the snapshot label to omit type and description in the config file like so
current/default installation: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot" "type" "description")
modification: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot")

npv12 commented

this may just be speculation but I think its some kind of text/buffer overflow. From what I have reviewed and tested (using Nobara 36) this bug did not occur for me with the default limit of 50 snapshots if I changed the snapshot label to omit type and description in the config file like so current/default installation: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot" "type" "description") modification: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot")

This helped me as well. Seems like when the description is unusually large then I can easily reproduce it failing to boot from snapshot.

this may just be speculation but I think its some kind of text/buffer overflow. From what I have reviewed and tested (using Nobara 36) this bug did not occur for me with the default limit of 50 snapshots if I changed the snapshot label to omit type and description in the config file like so current/default installation: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot" "type" "description") modification: GRUB_BTRFS_TITLE_FORMAT=("date" "snapshot")

I confirm, by deleting some of the snapshots and leaving a couple of pieces, the vmlinuz entries started to reappear.

I think this is because the /var directory is read-only and because the snapshot is not created in read-write mode. You could try setting the snapshot as read-write by doing:

$ sudo btrfs property set -ts /.snapshots/1/snapshot ro false

Then try booting into the snapshot. Let us know if it works so others know what the issue is.