Unable to provision Bare Metal embedding an Ignition config via coreos-installer 0.15.0
freebsdizzle opened this issue · 1 comments
Bug
Most likely user error - looking for guidance, please. Have tried using the binary and container:release
Host Operating System Version
Have tried same process on various flavors with the same results.
root@infra-2-bm:~# uname -a
Linux infra-2-bm 5.4.0-1040-ibm #45-Ubuntu SMP Mon Nov 28 13:10:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@infra-2-bm:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
[root@infra-1-bm ~]# uname -r
4.18.0-372.40.1.el8_6.x86_64
[root@infra-1-bm ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.6 (Ootpa)
Target Operating System Version
Red Hat CoreOS 4.11.22
coreos-installer Version
0.15.0
Expected Behavior
1A. Bare Metal nodes to provision successfully via the coreos-installer tool embedded ignition file.
Actual Behavior
On first boot, we're able to ssh in to Fedora CoreOS given that we're passing the same ssh-pub-key that was used as part of the install-config.yaml
After the initial reboot to the RH CoreOS kernel, we lose ssh access via the core user and the Bare Metal server never provisions into a worker node. oc get csr
never produces a certificate signing request as expected.
We have followed a similar bootstrap process using the same worker.ign for virtual machines given we're targeting a UPI none-integrated deployment. And we have no issues bootstrapping a fully functional 4.11.22 cluster just cannot provision Bare Metal via the coreos-installer tool.
Reproduction Steps
1A. Run simple script to create image and dd to disk - not full script but just to show the gist
#!/bin/bash
#
SUMCI='46a5424069a1f25126f12568f30731ff2f79b9b5f51e29dc5976d7d9942b67d4'
dnf install qemu-img vim podman -y
curl -O https://mirror.openshift.com/pub/openshift-v4/clients/coreos-installer/latest/coreos-installer
chmod +x ./coreos-installer
BIN=`sha256sum coreos-installer | cut -d " " -f1`
if [ ${SUMCI} != ${BIN} ]; then echo "signature mismatch!"; exit 1; fi
mkdir /mnt/ramdisk
mount -t tmpfs -o size=5G tmpfs /mnt/ramdisk
qemu-img create /mnt/ramdisk/coreos.raw 5G
modprobe nbd max_part=8
qemu-nbd --connect=/dev/nbd0 -f raw /mnt/ramdisk/coreos.raw
echo '{"ignition":{"config":{"merge":[{"source":"https://api-int.ocp4.example.com:22623/config/worker"}]},"security":{"tls":
{"certificateAuthorities":[{"source":"data:text/plain;charset=utf- 8;base64,L}]}},"version":"3.2.0"}}' > worker.ign
podman run --pull=always --rm -i quay.io/coreos/ignition-validate:release - < worker.ign
./coreos-installer install /dev/nbd -p metal -i worker.ign
1B. dd image to disk and reboot - link to screen recording
dd if=/mnt/ramdisk/coreos.raw of=/dev/sda bs=1M
Other Information
fdisk -l /mnt/ramdisk/coreos.raw
GPT PMBR size mismatch (4859903 != 10485759) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /mnt/ramdisk/coreos.raw: 5 GiB, 5368709120 bytes, 10485760 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001
Device Start End Sectors Size Type
/mnt/ramdisk/coreos.raw1 2048 4095 2048 1M BIOS boot
/mnt/ramdisk/coreos.raw2 4096 264191 260096 127M EFI System
/mnt/ramdisk/coreos.raw3 264192 1050623 786432 384M Linux filesystem
/mnt/ramdisk/coreos.raw4 1050624 4859870 3809247 1.8G Linux filesystem
Disk /dev/sda: 894.2 GiB, 960129990656 bytes, 1875253888 sectors
Disk model: SMC VD
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001
Device Start End Sectors Size Type
/dev/sda1 2048 4095 2048 1M BIOS boot
/dev/sda2 4096 264191 260096 127M EFI System
/dev/sda3 264192 1050623 786432 384M Linux filesystem
/dev/sda4 1050624 4859870 3809247 1.8G Linux filesystem
As a work around, we wanted to test using the RH CoreOS image locally obtained from mirror.openshift.com
Can an alternative local image file be used as such?
Expected Behavior
2A. Be able to use RH CoreOS images obtained from the openshift mirror as local images to provision nodes instead of the FedoraCoreOs streams.
Actual Behavior
Copying image from rhcos-4.11.9-x86_64-metal.x86_64.raw
Reading signature from rhcos-4.11.9-x86_64-metal.x86_64.raw.sig
Error: sniffing input
Caused by:
Broken pipe (os error 32)
Resetting partition table
Error: install failed
Reproduction Steps
see file attached
This report has several features that aren't involved in, or are outside the scope of, a typical coreos-installer run. I'm seeing mentions of installing to a loopback network block device and then dd
ing onto the target disk (rather than installing directly onto the target disk), rebooting from Fedora CoreOS into RHEL CoreOS, losing SSH access to already-provisioned machines, and CSRs. I haven't seen the reported Broken pipe
error before, but one possibility is that gpg
is exiting early for an unknown reason; it's being run in the background to verify the GPG signature.
I'd recommend following the OpenShift bare-metal documentation here, as much as you can. If you have requirements that aren't met by the standard install flows, then other flows are possible, but the degree of customization here makes it difficult to tell what's going on. For general installation help, please contact OpenShift support. If you can identify a specific reproducible coreos-installer behavior that you think is incorrect, feel free to open a new bug here.