coreos/coreos-installer

Unable to provision Bare Metal embedding an Ignition config via coreos-installer 0.15.0

freebsdizzle opened this issue · 1 comments

Bug

Most likely user error - looking for guidance, please. Have tried using the binary and container:release

Host Operating System Version

Have tried same process on various flavors with the same results.

root@infra-2-bm:~# uname -a
Linux infra-2-bm 5.4.0-1040-ibm #45-Ubuntu SMP Mon Nov 28 13:10:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@infra-2-bm:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

[root@infra-1-bm ~]# uname -r
4.18.0-372.40.1.el8_6.x86_64
[root@infra-1-bm ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.6 (Ootpa)

Target Operating System Version

Red Hat CoreOS 4.11.22

coreos-installer Version

0.15.0

Expected Behavior

1A. Bare Metal nodes to provision successfully via the coreos-installer tool embedded ignition file.

Actual Behavior

On first boot, we're able to ssh in to Fedora CoreOS given that we're passing the same ssh-pub-key that was used as part of the install-config.yaml

After the initial reboot to the RH CoreOS kernel, we lose ssh access via the core user and the Bare Metal server never provisions into a worker node. oc get csr never produces a certificate signing request as expected.

We have followed a similar bootstrap process using the same worker.ign for virtual machines given we're targeting a UPI none-integrated deployment. And we have no issues bootstrapping a fully functional 4.11.22 cluster just cannot provision Bare Metal via the coreos-installer tool.

Reproduction Steps

1A. Run simple script to create image and dd to disk - not full script but just to show the gist

#!/bin/bash
#
SUMCI='46a5424069a1f25126f12568f30731ff2f79b9b5f51e29dc5976d7d9942b67d4'
dnf install qemu-img vim podman -y
curl -O https://mirror.openshift.com/pub/openshift-v4/clients/coreos-installer/latest/coreos-installer
chmod +x ./coreos-installer
BIN=`sha256sum coreos-installer | cut -d " " -f1`
if [ ${SUMCI} != ${BIN} ]; then echo "signature mismatch!"; exit 1; fi
mkdir /mnt/ramdisk
mount -t tmpfs -o size=5G tmpfs /mnt/ramdisk
qemu-img create /mnt/ramdisk/coreos.raw 5G
modprobe nbd max_part=8
qemu-nbd --connect=/dev/nbd0 -f raw /mnt/ramdisk/coreos.raw
echo '{"ignition":{"config":{"merge":[{"source":"https://api-int.ocp4.example.com:22623/config/worker"}]},"security":{"tls": 
{"certificateAuthorities":[{"source":"data:text/plain;charset=utf- 8;base64,L}]}},"version":"3.2.0"}}' > worker.ign
podman run --pull=always --rm -i quay.io/coreos/ignition-validate:release - < worker.ign
./coreos-installer install /dev/nbd -p metal -i worker.ign

1B. dd image to disk and reboot - link to screen recording

dd if=/mnt/ramdisk/coreos.raw of=/dev/sda bs=1M

user-error-install.txt

Other Information

fdisk -l /mnt/ramdisk/coreos.raw
GPT PMBR size mismatch (4859903 != 10485759) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /mnt/ramdisk/coreos.raw: 5 GiB, 5368709120 bytes, 10485760 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device                     Start     End Sectors  Size Type
/mnt/ramdisk/coreos.raw1    2048    4095    2048    1M BIOS boot
/mnt/ramdisk/coreos.raw2    4096  264191  260096  127M EFI System
/mnt/ramdisk/coreos.raw3  264192 1050623  786432  384M Linux filesystem
/mnt/ramdisk/coreos.raw4 1050624 4859870 3809247  1.8G Linux filesystem

Disk /dev/sda: 894.2 GiB, 960129990656 bytes, 1875253888 sectors
Disk model: SMC VD          
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device       Start     End Sectors  Size Type
/dev/sda1     2048    4095    2048    1M BIOS boot
/dev/sda2     4096  264191  260096  127M EFI System
/dev/sda3   264192 1050623  786432  384M Linux filesystem
/dev/sda4  1050624 4859870 3809247  1.8G Linux filesystem

As a work around, we wanted to test using the RH CoreOS image locally obtained from mirror.openshift.com

Can an alternative local image file be used as such?

Expected Behavior

2A. Be able to use RH CoreOS images obtained from the openshift mirror as local images to provision nodes instead of the FedoraCoreOs streams.

Actual Behavior

Copying image from rhcos-4.11.9-x86_64-metal.x86_64.raw
Reading signature from rhcos-4.11.9-x86_64-metal.x86_64.raw.sig

Error: sniffing input

Caused by:
    Broken pipe (os error 32)

Resetting partition table
Error: install failed

Reproduction Steps

see file attached

user-error-mirror.openshift.txt

This report has several features that aren't involved in, or are outside the scope of, a typical coreos-installer run. I'm seeing mentions of installing to a loopback network block device and then dding onto the target disk (rather than installing directly onto the target disk), rebooting from Fedora CoreOS into RHEL CoreOS, losing SSH access to already-provisioned machines, and CSRs. I haven't seen the reported Broken pipe error before, but one possibility is that gpg is exiting early for an unknown reason; it's being run in the background to verify the GPG signature.

I'd recommend following the OpenShift bare-metal documentation here, as much as you can. If you have requirements that aren't met by the standard install flows, then other flows are possible, but the degree of customization here makes it difficult to tell what's going on. For general installation help, please contact OpenShift support. If you can identify a specific reproducible coreos-installer behavior that you think is incorrect, feel free to open a new bug here.