RPi-Distro/pi-gen

Buster builds broken on non-arm hosts

XECDesign opened this issue Β· 47 comments

Opening this as a heads to anyone relying on pi-gen.

https://bugs.launchpad.net/qemu/+bug/1805913

Unless this bug is fixed by the time buster goes live, images built through qemu-arm-static are going to be broken in slightly subtle ways. Luckily, qemu devs are pretty good and the issue is likely to be resolved before then.

pixbuf relies on the mime database, which silently fails to update and returns success. The result is that desktop is rendered without any icons.

Something similar happens with SSL certificates, breaking rpi-update and anything else that wants to use https.

Those are the known ways images break, but any binary that uses readdir() is not going to work.

Internally, we've moved our builds to an arm build server to avoid going through qemu for now.

@XECDesign thanks for the heads-up. And congratulations on releasing the new RPI4 with Buster!

Are you using this same pi-gen repository internally with your arm build servers? In other words is this repository still considered the official reference Raspbian image builder? Are you planning to update pi-gen to build buster images regardless of the problem with non-arm hosts? Thanks!

I've just pushed the commits that we were using internally, but couldn't make public yet.

Not sure how to approach non-arm host builds right now.

Thanks a lot for that! Appreciated!
I guess for now we have to just wait for the qemu issue to get fixed upstream, unfortunately, i.e. patience.

It's looking more like a kernel issue, but discussions I've seen on the mailing list seem to have fizzled out a long time ago without any resolution. Maybe when Buster is more commonly used it will press the issue.

Hopefully it gets more attention now that Buster is about to go live.
One final question in case you know from the top of your head: this problem only happens when the host is 64 bits or it happens on any host that is non-arm? If the host is non-arm but 32 bits do you think it would work ok? I can test using a VM if you are not sure.

Sorry, not sure off the top of my head.

No problem, will quickly setup a VM with 32-bits Debian and check it out, if it works, then we have a temporary solution for the moment until it's fixed. Will report back. Thanks again!

Hi @XECDesign , so I can confirm now that this issue is specifically for hosts with 64 bits kernels no matter if arm or not. I made a testing VM with vanilla Debian i386 (32 bits kernel) and the generated image works fine on real hardware (tested with RPI 3B).

To verify this I built a control "broken" Buster image with a 64 bits Debian host using Docker and I did get SSL problems with curl (with GitHub and other websites too), for example:

$ curl -sSL https://github.com
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Then I built another Buster image using a 32 bits Debian host and the curl command above worked fine on the same hardware and same network almost at the same time.

Aside, I also noticed a minor issue with the Qemu version shipped with Debian Stretch, where the man-db package being installed for Buster in the image triggers many of these errors:

qemu: Unsupported syscall: 383

This is a manifestation of the following bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891109

Fortunately this is resolved in the current Qemu version shipped with Debian Buster, therefore building with a Debian Buster host will not show any errors. I will send a PR to update the Dockerfile for this.

In summary, I wanted to let you know that this bug is not affecting 32 bits build hosts, no matter if they are arm or not (at least for me). For now you can use a 32 bits build host and pi-gen will generate a working image.

Hope this is useful for future readers!

Thanks for looking into it. Much appreciated.

On a similar note, if you run buster in a chroot on Ubuntu 18.04 you will need to upgrade proot to 5.1.0-1.13 (if you use it) and qemu-user to something newer than 2.11 (3.1 works). This is because buster uses renameat2() and new features of getauxval(). These versions are available in Ubuntu 19.04 but not 18.04 LTS. In particular bootstrapping will not work in proot without both these upgrades because 'mv' command will not work, which completely borks the preinst and postinst scripts.

This bug is an example of the kind of issue you'll see with too old qemu versions:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923289

Hi all, we were able to overcome the issue of SSL certs only by rehashing the ssl certs with c_rehash

Specifically

# Patch Issue with ssl certs as per https://github.com/RPi-Distro/pi-gen/issues/271
on_chroot << EOF
echo Patching certs...
c_rehash /etc/ssl/certs
EOF

Credit to Keith Tate as well :)

As indicated below by @hhromic its not a complete solution.

@Chaser that is a workaround for the ca-certificates package only, not a solution to the actual problem described in this issue. I wouldn't advice advertising it as a solution.

The problem affects any program using the readdir() syscall, not just ca-certificates and the effects are of varying nature. The SSL certificates issue is just one manifestation. Another (as the original post indicated) are icons not being correctly rendered in the Desktop.

It is unknown/unverified what other packages might be affected. Therefore the safest solution for now is to build using a 32-bits host (be ARM or non-ARM) as indicated before.

Thanks @hhromic updated my comment to be clear its not a solution.

Are there tests that should be done to confirm issues? I have just built an image on a EC2 ARM (64bit) instance (a1.2xlarge) running ubuntu 18.04 LTS. I would like to do some sanity checks on it.

@hhromic @XECDesign - did you attempt to use a i386 container and see the results? I have heard reports it worked within our team.

@Chaser I tested using a 32-bits Debian kernel inside a VM as explained on my comment here: #271 (comment)
That is not the same as an i386 userland running inside a 64 bits Docker host (which has a 64 bits kernel), if that is what you mean. Nevertheless I didn't try that approach, but I don't think it would work as the problem is related to the kernel and Qemu.

If you try it and you can confirm it works like I described in my comment, then it would be nice to know. Thanks!

@hhromic - Clean execution of todays pi-gen mainline 1143530

As is pulls down qemu-user-static amd64

Get:103 http://cdn-fastly.deb.debian.org/debian buster/main amd64 qemu-user-static amd64 1:3.1+dfsg-8~deb10u1 [21.1 MB]

Changing to FROM i386/debian:buster

Get:103 http://cdn-fastly.deb.debian.org/debian buster/main i386 qemu-user-static i386 1:3.1+dfsg-8~deb10u1 [22.5 MB]

The build completed successfully. Hopefully this helps.

buster_default
buster_i386

@Chaser be aware that a successful build is not sufficient proof as no errors during building doesn't imply that the system was built correctly. As explained in the original post, this is a silent bug therefore the build succeeds but the built image is broken.

To verify your built image, burn it to an SD Card and boot a real RPI device with it. Then perform the simple test I explained in my comment: #271 (comment)

$ curl -sSL https://github.com

If you get an SSL error message, then it didn't work. If you get HTML content, then it worked.

@hhromic - understood, curl works as expected. HTML content received.

@Chaser that is a very interesting result then and would mean that actually just Qemu needs to be in 32 bits, not the host kernel. That refiniment is indeed way better than using a 32 bits VM.

Can you confirm you were using a 64 bits Docker host for this test?
I will also give it a try myself too to double-check. Appreciate the testing!

@hhromic - was using Codebuild docker image - https://github.com/aws/aws-codebuild-docker-images/blob/master/ubuntu/standard/2.0/Dockerfile

uname -a output:

Linux 046536f42b8e 4.14.123-86.109.amzn1.x86_64 #1 SMP Mon Jun 10 19:44:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@hhromic I usually run my builds in a VM running Ubuntu 18.04. I don't use docker. I can confirm using the 64-bit 18.04 I will get the SSL error message.
I then changed nothing else other than installing the 32-bit qemu-user-static package and rebuilt the image. Once deployed, I was able to get HTML content without any SSL error messages.
All that to say, I also think having the 32-bit qemu is all that's needed. Not sure if there's any other tests that we can do to prove there are no other issues.

@Chaser @rkubes thanks for your input, really appreciated!
I now tested it myself too and can also confirm that indeed is just Qemu that needs to be 32 bits, not the kernel of the host system.

I used the included Dockerfile from pi-gen and used the i386/debian:buster base image instead to bring 32 bits binaries (including Qemu) as @Chaser suggested. Worked fine on actual RPI.

Not sure if there's any other tests that we can do to prove there are no other issues.

It is not clear at the moment how to test 100% reliably, however the SSL certificates test is a very good indicator as far as I can tell because it provides a tangible control case.

I will send a PR to update the included Dockerfile. @Chaser thanks a lot again for your input, I didn't know there were i386 images for Docker out there, I would have tested that for sure otherwise.

EDIT: @ryanteck might be interested. You don't need to setup a VM nor a 32-bits kernel for your host build system, just make sure you are installing the i386 version of Qemu in multiarch.

N1c0o commented

I got a working build using debian 10 i386.
Desktop is fine and the ssl test don't spit any errors, i guess it s ok, no ?

@N1c0o Yes, probably fine. Contrary to the issue title, if you read through the thread the issue was identified with using a 64-bit QEMU. Your Debian on i386 architecture would have had a 32-bit QEMU, so there should be no issues.

Maybe my solution is useful for some people: I have decided to take the approach of using Vagrant with Virtualbox because of the fact I can contain everything in a quite portable vm with all dependencies inside (including proxy-cache for the packages). Just create a file Vagrantfile in the root of pi-gen repo like this:

$run = <<"SCRIPT"
echo ">>> Generating rpi image ... $@"
export DEBIAN_FRONTEND=noninteractive
export RPIGEN_DIR="${1:-/home/vagrant/rpi-gen}"
export APT_PROXY='http://127.0.0.1:3142' 
# Prepare. Copy the repo to another location to run as root
rsync -a --delete --exclude 'work' --exclude 'deploy' /vagrant/  ${RPIGEN_DIR}/
cd ${RPIGEN_DIR}
# Clean previous builds. Start always from scratch (the proxy helps here!)
sudo umount --recursive work/*/stage*/rootfs/{dev,proc,sys} || true
# Delete old builds
sudo rm -rf work/*
# Build it again
sudo --preserve-env=APT_PROXY ./build.sh
# Copy images back to host
[ -d deploy ] && cp -vR deploy /vagrant/
SCRIPT

Vagrant.configure("2") do |config|
  # All Vagrant configuration is done here. The most common configuration
  # options are documented and commented below. For a complete reference,
  # please see the online documentation at vagrantup.com.  

  config.vm.define :rpigen do |rpigen|
      # Every Vagrant virtual environment requires a box to build off of.
      rpigen.vm.box = "jriguera/rpibuilder-buster-10.0-i386"
      rpigen.vm.provision "shell" do |s|
        s.inline = $run
        s.args = "#{ENV['WORK_DIR']}"
      end
  end
end

and run vagrant up . It will start downloading the Virtualbox base image (based on Debian Buster i386) and after done, it will run the build.sh script of the repo. Once done, it will put the images in the deploy folder. If the process fails, you can run again with vagrant provision. vagrant destroy will delete the vm and its contents. The source is here: https://github.com/jriguera/packer-rpibuilder-vagrant so you can customize it and create your own Raspbian builder vm.

It seems it is fixed on master now, someone could confirm?

I'm not aware of any fixes going in.

my bad I used a patched docker images.

So as per @Chaser's comment, one way to produce a working image from a 64-bit host is to use ./build-docker.sh with this modification to the Dockerfile:

diff --git a/Dockerfile b/Dockerfile
index 706a5fb..cf9aac4 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,4 +1,4 @@
-FROM debian:buster
+FROM i386/debian:buster
 
 ENV DEBIAN_FRONTEND noninteractive

It appears to me that the underlying readdir issue won't happen on btrfs. That would imply that the docker-using people here aren't using btrfs as storage driver, which is also plausible as it isn't the default.

Can somebody please confirm image building works for them as well on btrfs?

Any word(s) on this issue, besides the ones not used in polite company? Am I only going to be able to run ./build.sh in a Debian i386 VBox VM, or will this be resolved? It's been a year. I have streamlined things over the past year: I build a VBox Debian Buster Lite i386 2GB/50GB with the 'depends' and other files. I install apt-cacher-ng with settings to save bandwidth/time. So, there is that, when building multiple versions, thanks.
That just limits me to 2 of my lesser VirtualBox machines.
Help!

Can't do much until it's fixed upstream and progress is a little slow there.

The workaround I'm using at home is extracting qemu-arm-static from the i386 deb here https://packages.ubuntu.com/eoan/qemu-user-static

Since this issue has been opened for over a year and that "progress is a little slow there". Would it be a good time to put up a clear step by step workaround for this issue? I've tried to reverse engineer the comments in this thread but I always get lost in the process. Last XECDesign suggestion seams promising but I don't know where to extract those files and how to make sure the build process actually use them. Is this for docker or for the host build?
Is there another way to produce a working image without having to go through all that pain?

We could make pi-gen download and extract the i386 version of qemu-arm-static instead of copying whatever is on the system.

Seams pretty reasonable. Is it something that can be done quickly?

Just hit this issue while doing something else and the previous workaround didn't work. It looks like at least on some distributions it's no longer necessary to copy the qemu binary into the chroot. In my case, it was using the system's qemu binary rather than the 32bit one I was copying into the chroot.

The workaround I'm using right now is to override the qemu path binfmt uses with a local 32bit copy (edit as appropriate in your case):

if [ "$(dpkg --print-architecture)-$ARCH" = "amd64-armhf" ] && [ ! -e /proc/sys/fs/binfmt_misc/sbuild-arm ]; then
	echo ":sbuild-arm:M::\x7f\x45\x4c\x46\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x28\x00:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff\xff:$(realpath "qemu/qemu-arm-static"):OCF" > /proc/sys/fs/binfmt_misc/register
fi

Then to remove that override:

if [ -e /proc/sys/fs/binfmt_misc/sbuild-arm ]; then
	echo "-1" > /proc/sys/fs/binfmt_misc/sbuild-arm
fi

Edit:
It looks like this is the relevant change that causes the different behaviour: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1815100

So, maybe that override with the F flag is a good approach for pi-gen to use in general, to avoid even having to copy the binary in the first place. Pi-gen would just need to check that multiarch support is enabled and that the dependencies are installed, then fetch the binary from somewhere else - debian or ubuntu repos.

@XECDesign Hey mate, could we add #271 (comment) workaround to README? I've spent whole day trying to solve SSL issue and this comment solved it like a magic pill, it could help a lot of people!

Any update on this with Debian 11 Bullseye?

I can build Bullseye without issue on Bullseye AMD Machine

32 or 64 bit?

I can build fine on AMD machine too (pop_os 21.04 and 21.10) both armhf and arm64 for buster or bullseye.
I build server images thus

Depending on the version of qemu and whether you're using docker, server images might not exhibit issues, but I still wouldn't recommend it until qemu is fixed.

There has been some good progress upstream. One of the issues has been fixed and another has a fix in the pipeline (https://gitlab.com/qemu-project/qemu/-/issues/633). Not sure how long it will be before there's an official release with both fixes.

EDIT: I should mention that the fixes only make it work with i686 qemu. amd64 still won't work, but that seems to be a glibc and/or kernel issue that might not be fixable. I'm not sure what the current state of that is.

The qemu inside the Docker image seems to be irrelevant, you need qemu installed on the host (running a Docker build, I got errors about wrong architecture until I installed qemu-user-static on the host). And then I hit this bitness error (invalid SSL certificates etc).

I tried installing qemu-user-static:i386 on the host, but this makes GPG fail in the chroot, so apt can't verify signatures and it fails even earlier. Is there any valid workaround nowadays?

I need some clarification - it was mentioned in this thread that any binary using the readdir syscall is not going to work. To that I say "of course", as this syscall is not implemented in arm64 (or in any architecture that I know of other than x86). I have to be missing something obvious. Why would a binary compiled for arm64 even try to use the readdir syscall? Could someone explain?

Was it meant that any binary using the readdir() library function will not work? I could buy that.

I need some clarification - it was mentioned in this thread that any binary using the readdir syscall is not going to work.

It has been a while since I've looked at this so I can't give full details.

It's likely that you're thinking of this, while the issue is with this.

If the actual syscall is involved, a particular arm binary doesn't have to call it itself. It could be something qemu does, depending on the architecture it's built for and the paths it takes. Not sure.

Either way, the issue seems to be resolved in Bullseye, at least.

Just an update for someone getting here:

I confirm that I just built an arm64 bullseye lite (stage2) image using build.sh on the arm64 branch of this repo. My machine is an amd64 debian bullseye. After dumping the image on a real RPi4 and running curl -sSL https://github.com I got proper html code (so I guess it is indeed fixed)