Firmware handling of upstream kernel and Device Tree files
Closed this issue ยท 21 comments
Although most users are unaware of it, the RPi firmware has special support for "upstream" (compiled from source code found on kernel.org) Linux kernels and Device Tree files. Specifically, the upstream_kernel=1
flag can be used to request that the relevant upstream DTB files are loaded, falling back to the downstream variants as necessary and applying the upstream
overlay to help bridge the gap between the two worlds.
This mechanism has relied on the fact that the upstream developers have used a different naming convention for their DT files, basing them on the package name (BCM283x
) as opposed to the die name (BCM27xx
) (*). The Pi 4B SoC chip is called BCM2711 - there is no BCM2838, although the name is guaranteed free for use - and against expectations the upstream devs have chosen to also use bcm2711
as their SoC identifier. This will break the select-by-filename logic used up to now, so an alternative is needed.
Another little-known firmware feature is the ability to request that overlays are loaded from a different subdirectory of the boot partition (on SD card or network share). This is controlled by the overlay_prefix
setting, the default for which is overlays/
.
A third item for consideration is that it would be useful to be able to store multiple independent operating systems on a single image and select between them at boot time based on a config.txt setting. This would allow (for example) a "recovery" OS for the case when a bad kernel build prevents the device from booting (almost a daily occurrence for me, sometimes). I've recently implemented a physical "64-bit switch" on my daily driver Pi4 that pulls GPIO5 to ground, activating the [gpio5=0]
section that sets arm_64bit=1
. It would be nice to be able to extend that to multiple alternate builds of the same kernel type, etc.
Pulling these strands together brings me to suggest a new config.txt setting - os_prefix
. The default value would be the empty string, but if set it would be prepended to the names of "OS" files. Booting with os_prefix=backup-
might load backup-kernel.img
, whereas os_prefix=backup/
would cause the firmware to look in the backup
directory.
What constitutes an "OS" file? The kernel, .dtbs and cmdline.txt definitely fit the description, and I'm declaring that the firmware files (bootcode.bin
, start*.elf
, fixup*.dat
) don't. Overlays fall into a grey area - making them OS-specific is conceptually cleanest and most flexible, making them common saves a small amount of space. I'm leaning towards a hybrid scheme whereby the firmware looks for ${os_prefix}${overlay_prefix}README
(**) and, if found, sets overlay_prefix
to ${os_prefix}${overlay_prefix}
, otherwise leaving it unchanged. This allows shared and unshared overlays, but prevents a pick-and-mix approach.
The proposal for the handling of upstream files is that setting upstream_kernel=1
has an implicit side-effect of setting os_prefix=upstream/
(unless os_prefix
is explicitly set). Putting all upstream kernel files into a subdirectory allows upstream and downstream to coexist, regardless of the names of the individual files.
N.B. os_prefix
and upstream_kernel
only affect automatic file selection - they have no effect on explicit cmdline=
, kernel=
, device_tree=
and ramfsfile=
settings which are always relative to the root of the boot partition (or the network equivalent).
Does anybody have any improvements to suggest or concerns about this approach?
(*) This could be the wrong way round - all that matters is that that each chip effectively has two names.
(**) The network and USB boot mechanisms don't have a way to test for the existence of a directory, only a file (and in the case of a non-directory prefix it wouldn't make sense anyway) so test for the existence of the README instead.
@MilhouseVH Am I right in thinking that LibreELEC uses a custom overlay_prefix
? Does this scheme work for you?
No concerns from me and I think keeping upstream kernels, device trees etc in a separate directory is a good idea.
That's also the setup I chose quite a while ago when testing upstream kernels on a Raspbian SD card
Thanks for making a suggestion to handle these conflicts. The final goal would be use one DTB for up- and downstream kernel. Without the dwc_otg / dwc2 conflict on the BCM2711 this is much easier to achieve.
Until we reach this point, this sounds like a great improvement.
Btw i plan to submit the first PR for 4.19 to "push" the downstream kernel into the upstream direction. This would contain mostly the upstream DT changes until Linux 5.4.
From the distro perspective (at least talking for openSUSE) we are booting U-Boot, which uses distro_boot. Latter looks for broadcom/bcm2711-rpi-4-b.dtb on the FAT partition. If present it uses this to boot the kernel, otherwise it falls back to the DTB provided by the FW.
So I don't have a need for upstream_kernel=1
. My understanding is that os_prefix
gets ignored for the kernel if you provide kernel=u-boot.bin
explicitly. That's the only thing that could break the way we use config.txt
for now.
In general, at openSUSE we use FW -> U-Boot -> Grub boot path, so we handle different kernels through the Grub menu.
My understanding is that os_prefix gets ignored for the kernel if you provide kernel=u-boot.bin explicitly.
That's correct - I've amended the original post to make that clear.
The promised PR #3244 is available now.
After some internal discussion I changed the file location logic to make it simpler (i.e. more consistent) and more flexible. In the new scheme, os_prefix
is applied to all OS files (kernels, cmdline.txt, dtbs, initramfs). It is also applied to overlay_prefix
, provided ${os_prefix}${overlay_prefix}README
exists.
If upstream_kernel=1
, the firmware sets os_prefix=upstream/
provided os_prefix hasn't already been set and it can locate the kernel or the dtb there (remember that we can't check for the existence of a directory).
To allow the OS prefix to be skipped, all user-specified filenames may include a leading /
character, making the paths absolute w.r.t. the boot directory/FAT partition.
This revised logic is in the most recent firmware release. Please report any issues you find.
While not quite the original purpose of this issue, I noticed that this lays a lot of the ground work for an A/B rootfs. I am currently using Mender and it requires u-boot to handle booting to the different root partitions / kernels. It looks like with a couple additions to the firmware, u-boot might not be necessary. Is this something you'd be willing to incorporate into the firmware? I know one important feature currently missing is a boot count. At first glance, the rest of the features could be handled through config.txt or possibly a new special file with setting specific to this sort of setup. Is something that might be possible or am I just dreaming?
The main problem with implementing some kind of boot count is the extremely limited storage that survives a reset - we have 6 bits (yes, bits) to play with. Currently 0x3f means HALT, otherwise it is a partition number for the boot partition. 63 is a huge number of partitions, so there is some scope for, errm... partitioning it, but I think two bits would be the limit.
Please outline how you think such an A/B rootfs ought to work, and we'll see whether it sounds feasible without a huge amount of effort.
Hi @pelwell! Sorry for the delayed response!
I think two bits would work perfectly for the boot count.
The general idea with the A/B setup is to provide atomic image updates than can be rolled back if unsuccessful. New images are written to the secondary root partition and then re-booted into. If successful, it becomes the primary partition. Wash, rinse, repeat.
I've been discussing this with mirzak over on the Mender forum and I think we've got it mostly sorted out. I've done my best to keep this generic. The neat thing is that this would essentially be dual function and would add a nice (optional) recovery feature for people not utilizing an A/B setup. I'll do my best to outline the proposed additions below.
- A 2-bit counter (
bootcount
) that would persist across power cycles. - A userspace tool that would allow
bootcount
to be read and reset. - 3 new optional flags for config.txt
upgrade_available
Signals that a new image has been installed and should be booted. Unused in non-A/B scenarios. Defaults to 0 if not specified.bootlimit
Sets the threshold for failed boots (set to 1 for A/B setup, 4 if unspecified.)recovery_os_prefix
Same idea asos_prefix
, but for the "recovery" os.
- New boot logic approximated below.
if(bootcount<bootlimit && upgrade_available==0)
//Normal boot
boot();
elseif(bootcount<bootlimit && upgrade_available==1)
//Upgrade is pending, boot to it
os_prefix=recovery_os_prefix;
boot();
elseif(bootcount>=bootlimit && upgrade_available==1)
//Upgrade failed, boot normal os
boot();
elseif(bootcount>=bootlimit && upgrade_available==0)
//Normal boot failed, fall back to recovery
/*Note: This may not be the desired behavior in all situations.
An additional flag was proposed to select this behavior, but it
could also be handled by modifying the recovery_os_prefix or
otherwise disabling the "recovery os" in userspace. This would
keep the code here simple and stop forced "insecure" boots. */
boot();
Root partitions would be specified in separate cmdline.txt's for each os_prefix
. Userspace tools would handle updating the os_prefix
and upgrade_available
flags before/after updates.
Does this sound like something that might be feasible to incorporate? I'd be thrilled to get rid of my custom version of u-boot in favor of the stock bootloader. If I failed to explain something satisfactorily, just let me know and I'll give it another go. Also, if you'd rather this was submitted as it's own issue, I'm happy to do that as well.
Cheers!
PS
It just occurred to me that rather than having the boot logic in the FW, an alternative might be sections in config.txt like [bootcount=1] or similar. I'm not sure if that's feasible or even a good idea, but I thought I'd mention it as it popped into by head as I was about to hit submit.
Just checking in to see if there was any movement on this, one way or the other. Thanks!
Your reply came in at a busy time, and not being at all what I expected it was put to one side for later review. And now a month later I'm reminded why I couldn't just say "yes, sure!".
My original answer was based on the idea of using another few bits of the very scarce reset-proof state, but that doesn't survive a power cycle so could only be used with a user-accessible reset signal (removing the need to pull the power cable in the event of a boot failure). The current firmware never writes to persistent storage, and making it do so is a big step that we are reluctant to take. A boot count, by definition, relies on changing state at every boot (twice, in the event of a successful boot). One can think of journal-like schemes where a pool of sectors is used to spread the load of maintaining the count, but it's still write operations, and that's a road we don't want to go down.
The current firmware never writes to persistent storage [...]
The recovery.bin
from the eeprom firmware upgrade process renames itself, so at least in theory it seems possible. Maybe keeping the state within a magical filename itself is easier than writing to their content, as that requires fewer indirection and checks? That said: I totally understand and agree with the reluctance to add any more state changing code to the firmware as the seems fragile and probably adds a ton of new error paths and A/B booting is already possible without these changes.
Ah - recovery.bin is a special case, and doesn't really form part of what I think of as the firmware (start*.elf), although clearly it is. You are right that such a change is possible, but it's not feasibility that's preventing us proceeding - see Ian Malcolm's views on could and should.
ping @agherzan I've just stumbled across this issue and it sounds like it might be of interest to BalenaOS too? ๐
@pelwell A bit of a pie-in-the-sky question, but would there ever be any config.txt
settings that somebody might want to set to different values for different OSes? If it doesn't make the logic too complicated, would it be worth adding config.txt filter-sections based on the current value of os_prefix
? ๐คทโโ๏ธ (I'm not an OS developer, so this isn't something I'm looking for myself, I'm just playing devil's advocate ๐ )
@pelwell, apologies for the extremely delayed reply. While certainly not the news I was hoping for, I completely understand the reluctance to add writing to persistent storage to the firmware. I guess it's back to uboot for me... Thanks again for taking the time to look it over!
@pelwell Does the lovely documentation in raspberrypi/documentation#1433 mean that this feature has now been implemented, and this issue should be closed?
Yup.