monogon-dev/monogon

node: EFI firmware is losing our Bootentries

Closed this issue · 3 comments

This is a tracker issue to find out what the EFI firmware of (Dell|.+)? Servers have against us.
TL;DR:

Takeover creates bootentry Metropolis Slot A
Metropolis installs update to Slot B and creates entry Metropolis Slot B
Reboot
Entry B is missing

Speculation:
Dell seems to scan all ESP partitions for a cache file and adds entries that are stored there back to efivars.


This is my debugging story on a Dell PowerEdge R6515:

  1. Running in alpine based recovery environment
  2. Validate current efi config
localhost:~# efibootmgr
BootCurrent: 0002
BootOrder: 0002,0005,0006
Boot0002* NIC in Slot 3 Port 1 Partition 1
Boot0005* grub
Boot0006* ubuntu
  1. run binary to execute metropolis update routine for adding boot entries
localhost:~# ./boot 1 e145c06a-ff7a-4e6e-a9a1-e099623068a8
2023/07/29 15:50:54 root                             I0729 15:50:54.332455 main.go:53] Boot entries before: map[2:0xc0000ec5a0 5:0xc0000ec5f0 6:0xc0000ec640]
2023/07/29 15:50:54 root                             I0729 15:50:54.332471 main.go:81] Not our entry: &{Description:NIC in Slot 3 Port 1 Partition 1 Inactive:false Hidden:false Category:0 FilePath:[0xc0000e0600] OptionalData:[]}
2023/07/29 15:50:54 root                             I0729 15:50:54.347381 main.go:120] Adding entry: 10 - &{Description:Metropolis Slot A Inactive:false Hidden:false Category:0 FilePath:[0xc00024a000 /EFI/metropolis/boot-a.efi] OptionalData:[]}, <nil>
2023/07/29 15:50:54 root                             I0729 15:50:54.347390 main.go:81] Not our entry: &{Description:NIC in Slot 3 Port 1 Partition 1 Inactive:false Hidden:false Category:0 FilePath:[0xc0000e0600] OptionalData:[]}
2023/07/29 15:50:54 <nil>
2023/07/29 15:50:54 root                             I0729 15:50:54.362715 main.go:120] Adding entry: 11 - &{Description:Metropolis Slot B Inactive:false Hidden:false Category:0 FilePath:[0xc000128060 /EFI/metropolis/boot-b.efi] OptionalData:[]}, <nil>
2023/07/29 15:50:54 root                             I0729 15:50:54.362724 main.go:64] Boot entries after: map[2:0xc0000ec5a0 5:0xc0000ec5f0 6:0xc0000ec640 10:0xc0000ec820 11:0xc000160140]
  1. validate result
localhost:~# efibootmgr
BootCurrent: 0002
BootOrder: 0002,0005,0006
Boot0002* NIC in Slot 3 Port 1 Partition 1
Boot0005* grub
Boot0006* ubuntu
Boot000A* Metropolis Slot A
Boot000B* Metropolis Slot B
  1. reboot again into recovery

  2. boot entries are sorted differently and are missing Metropolis Slot B

root@shepherd-prod-646c5438-3247-4d53:~# efibootmgr
BootCurrent: 0006
BootOrder: 0002,0005,0006,0003
Boot0002* NIC in Slot 3 Port 1 Partition 1
Boot0003* Metropolis Slot A
Boot0005* grub
Boot0006* ubuntu

@lorenz suspects the firmware setting "UEFI Variable Access" is set to protected, which is sadly not the case.

UEFI Variable Access


We arent writing the PartitionStartBlock and PartitionSizeBlocks inside the EFI entries as they are optional. Turns out that the firmware is adding that ?!? (This is speculation as there is too much weird behaviour).

Again just running the routine to ensure our entries exist

2023/07/30 23:12:00 root                             I0730 23:12:00.408783 main.go:53] Boot entries before: map[0:0xc0000ec5a0 1:0xc0000ec5f0 2:0xc0000ec640 3:0xc0000ec690 5:0xc0000ec6e0 6:0xc0000ec730 10:0xc0000ec780]
2023/07/30 23:12:00 root                             I0730 23:12:00.408798 main.go:81] Not our entry: &{Description:NIC in Slot 3 Port 1 Partition 1 Inactive:false Hidden:false Category:0 FilePath:[0xc0000e06c0] OptionalData:[]}
2023/07/30 23:12:00 root                             I0730 23:12:00.408805 main.go:100] Found entry: &{Description:Metropolis Slot A Inactive:false Hidden:false Category:0 FilePath:[0xc000011fb0 /EFI/metropolis/boot-a.efi] OptionalData:[]}
2023/07/30 23:12:00 root                             I0730 23:12:00.408809 main.go:81] Not our entry: &{Description:NIC in Slot 3 Port 1 Partition 1 Inactive:false Hidden:false Category:0 FilePath:[0xc0000e06c0] OptionalData:[]}
2023/07/30 23:12:00 root                             I0730 23:12:00.408813 main.go:100] Found entry: &{Description:Metropolis Slot B Inactive:false Hidden:false Category:0 FilePath:[0xc000248060 /EFI/metropolis/boot-b.efi] OptionalData:[]}
2023/07/30 23:12:00 root                             I0730 23:12:00.408818 main.go:64] Boot entries after: map[0:0xc0000ec5a0 1:0xc0000ec5f0 2:0xc0000ec640 3:0xc0000ec690 5:0xc0000ec6e0 6:0xc0000ec730 10:0xc0000ec780]

Suddenly there are two Slot A's and two grub's.

Boot0000* Metropolis Slot A    HD(1,GPT,e145c06a-ff7a-4e6e-a9a1-e099623068a8,0x0,0x0)/File(\EFI\metropolis\boot-a.efi)
Boot0001* grub    HD(1,GPT,14cee537-9dc4-4ecf-920c-390b78483678,0x800,0x100000)/File(\EFI\grub\grubx64.efi)
Boot0002* NIC in Slot 3 Port 1 Partition 1    VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0003* Metropolis Slot A    HD(1,GPT,e145c06a-ff7a-4e6e-a9a1-e099623068a8,0x800,0xc0000)/File(\EFI\metropolis\boot-a.efi)
Boot0005* grub    HD(1,GPT,14cee537-9dc4-4ecf-920c-390b78483678,0x800,0x100000)/File(\EFI\grub\shimx64.efi)
Boot0006* ubuntu    HD(1,GPT,14cee537-9dc4-4ecf-920c-390b78483678,0x800,0x100000)/File(\EFI\ubuntu\shimx64.efi)
Boot000A* Metropolis Slot B    HD(1,GPT,e145c06a-ff7a-4e6e-a9a1-e099623068a8,0x0,0x0)/File(\EFI\metropolis\boot-b.efi)

After this happend, it was impossible to delete entries either with efibootmgr or the firmware configuration screen.
unnamed

After deleting the ESP partition that on sda (from the default ubuntu installation), it was possible to delete their boot entries. See speculation under TLDR.

It was also tested if adding PartitionStartBlock and PartitionSizeBlocks to the entries will make them survive but no change in behavior was found.


@fionera remembers there is some Dell folder in our ESP which apparently is used by Dell to cache things?
Deleting this doesnt help

localhost:~# xxd /mnt/EFI/Dell/BootOptionCache/BootOptionCache.dat
00000000: 0100 0100 e6c5 4843 9200 0000 0142 006f  ......HC.....B.o
00000010: 006f 0074 0030 0030 0030 0030 0000 0001  .o.t.0.0.0.0....
00000020: 0000 0068 004d 0065 0074 0072 006f 0070  ...h.M.e.t.r.o.p
00000030: 006f 006c 0069 0073 0020 0053 006c 006f  .o.l.i.s. .S.l.o
00000040: 0074 0020 0042 0000 0004 012a 0001 0000  .t. .B.....*....
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 006a c045 e17a ff6e 4ea9 a1e0 9962 3068  .j.E.z.nN....b0h
00000070: a802 0204 043a 005c 0045 0046 0049 005c  .....:.\.E.F.I.\
00000080: 006d 0065 0074 0072 006f 0070 006f 006c  .m.e.t.r.o.p.o.l
00000090: 0069 0073 005c 0062 006f 006f 0074 002d  .i.s.\.b.o.o.t.-
000000a0: 0062 002e 0065 0066 0069 0000 007f ff04  .b...e.f.i......
000000b0: 0092 0000 0001 4200 6f00 6f00 7400 3000  ......B.o.o.t.0.
000000c0: 3000 3000 3100 0000 0100 0000 6800 4d00  0.0.1.......h.M.
000000d0: 6500 7400 7200 6f00 7000 6f00 6c00 6900  e.t.r.o.p.o.l.i.
000000e0: 7300 2000 5300 6c00 6f00 7400 2000 4100  s. .S.l.o.t. .A.
000000f0: 0000 0401 2a00 0100 0000 0000 0000 0000  ....*...........
00000100: 0000 0000 0000 0000 0000 6ac0 45e1 7aff  ..........j.E.z.
00000110: 6e4e a9a1 e099 6230 68a8 0202 0404 3a00  nN....b0h.....:.
00000120: 5c00 4500 4600 4900 5c00 6d00 6500 7400  \.E.F.I.\.m.e.t.
00000130: 7200 6f00 7000 6f00 6c00 6900 7300 5c00  r.o.p.o.l.i.s.\.
00000140: 6200 6f00 6f00 7400 2d00 6100 2e00 6500  b.o.o.t.-.a...e.
00000150: 6600 6900 0000 7fff 0400                 f.i.......
  1. Installing MonogonOS from scratch will add the partitions and entries. efibootmgr shows the entries.
  2. reboot
  3. efibootmgr is missing the entries (The Dell cache file contains them tho)
  4. reboot
  5. efibootmgr now shows the entries

Sadly this is not the case for Slot B where even after multiple attempts and reboots the entries are not available

Slot B had a bug (will be fixed with https://review.monogon.dev/c/monogon/+/2031) and Dell deleted bootentries when they arent in the boot order. That is probably the reason why debugging was that flaky.

Closing for now as we migrated away from bootentries for A/B: #263