rocky-linux/rocky-tools

Mirrored EFI boot partitions not handled by migrate2rocky.sh

eborisch opened this issue · 13 comments

Starting here:

efi_mount=$(findmnt --mountpoint /boot/efi --output SOURCE \

for a system with software-mirrored (during install) EFI boot partitions, the migration tool fails.

For example, with (trimmed here) lsblk output of:

nvme1n1                         259:0    0   477G  0 disk
├─nvme1n1p2                     259:3    0   601M  0 part
│ └─md125                         9:125  0   601M  0 raid1 /boot/efi
nvme0n1                         259:1    0   477G  0 disk
├─nvme0n1p2                     259:6    0   601M  0 part
│ └─md125                         9:125  0   601M  0 raid1 /boot/efi

and following along in the script's steps:

# findmnt --mountpoint /boot/efi --output SOURCE --noheadings
/dev/md125
# lsblk -no pkname "/dev/md125"
<empty response>

which sets efi_disk empty.

When findmnt returns /dev/mdXXX, something along the lines of

# mdadm -v --detail --scan --export /dev/md125 | awk -F = '/DEVICE.*_DEV/{print $2}'
/dev/nvme0n1p2
/dev/nvme1n1p2
# lsblk -no pkname -l /dev/nvme0n1p2
nvme0n1p2
nvme0n1
# cat /sys/block/nvme0n1/nvme0n1p2/partition
2

(looping over the mdadm outputs) will be needed to collect the list of devices and partitions that need to be added as EFI boot entries. (here efibootmgr -c -d /dev/nvme0n1 -p 2 ..., and again with nvme1n1.)

Do I understand it correctly that you have your EFI system partition stored on a RAID block device managed by md?

How exactly is the EFI firmware (=the BIOS of your mainboard) able to boot from that? To my knowledge, the EFI specification defines that the EFI system partition is FAT, and support for RAID devices below the FAT aren't specified.

Or do you have a special RAID controller that supplies a EFI firmware blob to the EFI firmware on your mainboard to allow booting from the RAID device?

Your firmware must have some kind of knowledge and support of this setup for reliable operation, as the firmware or other OSses are allowed and expected to make their own modifications to the EFI system partition. Without their knowledge of the RAID setup, these modifications would just end up on one disk and screw up the whole RAID.

Agreed there are caveats. The Rocky installer happily lets you create it, and systems (Supermicro I've tried it on, and VirtualBox in EFI mode, too) happily boot off it.

Absolutely if there are other OSes that are touching the partition it is a bad idea. Many (most?) systems will not be in a multi-boot environment.

Here's what efibootmgr -v shows:

Boot0005* Rocky Linux	HD(2,GPT,<uuid>,0x201000,0x12c800)/File(\EFI\ROCKY\GRUBX64.EFI)
Boot0006* Rocky Linux	HD(2,GPT,<uuid>,0x201000,0x12c800)/File(\EFI\ROCKY\GRUBX64.EFI)

and mdadm:

# mdadm --detail /dev/md125
/dev/md125:
           Version : 1.0
     Creation Time : Fri Sep 11 09:41:47 2020
        Raid Level : raid1
        Array Size : 615360 (600.94 MiB 630.13 MB)
     Used Dev Size : 615360 (600.94 MiB 630.13 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Jun 29 12:17:16 2021
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : <host>:boot_efi  (local to host <host>)
              UUID : <uuid>
            Events : 143

    Number   Major   Minor   RaidDevice State
       0     259        6        0      active sync   /dev/nvme0n1p2
       1     259        3        1      active sync   /dev/nvme1n1p2

Note with v1 mdadm blocks are at the end of the partition.

I 100% agree that if you're using other OSes on the same drives, this is a bad idea; if it is a single-OS system and you want "one of my boot drives drive failed, please still boot", it works.

From a quick test on VirtualBox with Rocky installer:

Partition_Setup

Partition_Actions

Screen Shot 2021-06-30 at 5 53 09 PM

Screen Shot 2021-06-30 at 5 53 36 PM

Do I understand it correctly that you have your EFI system partition stored on a RAID block device managed by md?

It's not such an unusual setup, though it is largely being replaced in favor of straight up LVM lately.

How exactly is the EFI firmware (=the BIOS of your mainboard) able to boot from that? To my knowledge, the EFI specification defines that the EFI system partition is FAT, and support for RAID devices below the FAT aren't specified.

The EFI subsystem doesn't see the RAID, it's RAID1 so all it sees is the partition on the first RAID disk and boots to that as if it were a single FAT partition without RAID. After the initial boot sequence the system mounts it as RAID1 so that reads and writes utilize and maintain the partition on both disks. In theory (and if done properly in practice) this allows the boot disk to be switched
from one to the other in case of disk failure and the system can boot right up and then the RAID can be rebuilt with a new disk on the running system.

for a system with software-mirrored (during install) EFI boot partitions, the migration tool fails.

Thanks for pointing this out. It is something of an edge case nowadays, but I think we can resolve it.

findmnt --mountpoint /boot/efi --output SOURCE --noheadings

/dev/md125

lsblk -no pkname "/dev/md125"

```

Unfortunately I don't have an mdraid system to test this on, but I do have a system with LVM partitions and trying to resolve the parent of an LVM partition appears to be the same, and I think it will end up using the same logic as a dmraid system, so for now I will use that to research a solution and then get you to test it on your dmraid setup if that's okay.

mdadm -v --detail --scan --export /dev/md125 | awk -F = '/DEVICE.*_DEV/{print $2}'

...

(looping over the mdadm outputs) will be needed to collect the list of devices and partitions that need to be added as EFI boot entries. (here efibootmgr -c -d /dev/nvme0n1 -p 2 ..., and again with nvme1n1.)

Unfortunately we can't rely on mdadm being available on the target system when migrate2rocky is run, but I think I have another solution which will work (testing with LVM because that's what I have available):

[root@CentOS8 ~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0          11:0    1 1024M  0 rom  
vda         252:0    0   20G  0 disk 
├─vda1      252:1    0    1G  0 part /boot
└─vda2      252:2    0   19G  0 part 
  ├─cl-root 253:0    0   17G  0 lvm  /
  └─cl-swap 253:1    0    2G  0 lvm  [SWAP]
[root@CentOS8 ~]# lsblk -no pkname,name,kname,type,subsystems /dev/cl/root
       cl-root dm-0  lvm  block
[root@CentOS8 ~]# cd /sys/block/dm-0/slaves/
[root@CentOS8 slaves]# echo *
vda2
[root@CentOS8 slaves]# cat vda2/partition
2
[root@CentOS8 slaves]# lsblk -no pkname /dev/vda2
vda
vda2
vda2

...so, given the above, we can get the kname, which in this case is dm-0, use that to look up the slaves (in this case vda2, but it would likely be multiple slaves in your case) and then look up the disk and partition number for each one via the lsblk command again. Then we fix EFI on all the slaves. Can you have a look at your system and see if you think this approach would work?

For that last command, looks like it's better as:

[root@CentOS8 slaves]# lsblk -dno pkname /dev/vda2
vda

@pajamian I think this is what you're asking:

# findmnt --mountpoint /boot/efi --output SOURCE --noheadings
/dev/md125

# cd /sys/block/md125/slaves

# echo *
nvme0n1p2 nvme1n1p2

# cat nvme0n1p2/partition
2

# lsblk -dno pkname /dev/nvme0n1p2
nvme0n1

Where the last two results are what is needed — looping over the outputs from echo * above — to build the appropriate efibootmgr -c -d /dev/nvme0n1 -p 2 ... commands.

@eborisch Thanks, yes that's exactly what I wanted to know. I'll go ahead and implement the change.

Almost, and I've included the fix there. (Assigning to efi_disk= sets efi_disk[0], so the += following become [1] and [2].)

Try again with the latest commit.

Commented on #68; works great -- thanks!

Pulled and merged.