flatcar/Flatcar

flatcar-install fails when using path disk devices with multiple copies

Opened this issue · 5 comments

Description

flatcar-install fails when using path disk devices with multiple copies

ie:

# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx. 1 root root  9 Jul 18 14:01 pci-0000:44:00.0-ata-1 -> ../../sda
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1-part1 -> ../../sda1
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1-part2 -> ../../sda2
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1-part3 -> ../../sda3
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1-part4 -> ../../sda4
lrwxrwxrwx. 1 root root  9 Jul 18 14:01 pci-0000:44:00.0-ata-1.0 -> ../../sda
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1.0-part1 -> ../../sda1
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1.0-part2 -> ../../sda2
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1.0-part3 -> ../../sda3
lrwxrwxrwx. 1 root root 10 Jul 18 14:01 pci-0000:44:00.0-ata-1.0-part4 -> ../../sda4
lrwxrwxrwx. 1 root root  9 Jul 18 14:01 pci-0000:44:00.0-ata-2 -> ../../sdb
lrwxrwxrwx. 1 root root  9 Jul 18 14:01 pci-0000:44:00.0-ata-2.0 -> ../../sdb
[...]

Impact

flatcar-install -d /dev/disk-by-path/pci-0000:44:00.0-ata-1 -i provider.ign

Install script fails to mount the OEM partition to write the ignition file

Expected behavior

Install script completes successfully.

Additional information

The following command does work:

flatcar-install -d /dev/disk-by-path/pci-0000:44:00.0-ata-1.0 -i provider.ign

It seems the root cause is the way the OEM partition is located using blkid to write the Ignition file.

local OEM_DEV=$(blkid -t "LABEL=OEM" -o device "${DEVICE}"*)

This command returns multiple values in our case but handled like a single path, which consequently fails.

Hello @jqueuniet,

The flatcar install script can be found here https://github.com/flatcar/init/blob/flatcar-master/bin/flatcar-install:

According to the man page of blkid, adding the flag -l should solve the issue https://linux.die.net/man/8/blkid.

Can you please run this command on your server to confirm?

blkid -l -t "LABEL=OEM" -o device $DEVICE

Thank you.

It looks like it does solve my issue. The only side-effect I can see is that it returns the canonical device path (dev/sdXY) instead of staying on the same alias. Here is the output from a Flatcar PXE ramdisk:

root@localhost ~ # blkid -t "LABEL=OEM" -o device /dev/disk/by-path/pci-0000\:44\:00.0-ata-1*
/dev/disk/by-path/pci-0000:44:00.0-ata-1-part6
/dev/disk/by-path/pci-0000:44:00.0-ata-1.0-part6
root@localhost ~ # blkid -l -t "LABEL=OEM" -o device /dev/disk/by-path/pci-0000\:44\:00.0-ata-1
/dev/sdb6

By the way, here are the SATA controller details, in case anyone needs them:

44:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) (prog-if 01 [AHCI 1.0])
	Subsystem: Super Micro Computer Inc H12SSL-i [15d9:7901]
	Flags: bus master, fast devsel, latency 0, IRQ 91, IOMMU group 43
	Memory at b0600000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=16/16 Maskable- 64bit+
	Capabilities: [d0] SATA HBA v1.0
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2a0] Access Control Services
	Capabilities: [400] Data Link Feature <?>
	Capabilities: [410] Physical Layer 16.0 GT/s <?>
	Capabilities: [440] Lane Margining at the Receiver <?>
	Kernel driver in use: ahci
	Kernel modules: ahci

It looks like it does solve my issue. The only side-effect I can see is that it returns the canonical device path (dev/sdXY) instead of staying on the same alias. Here is the output from a Flatcar PXE ramdisk:

root@localhost ~ # blkid -t "LABEL=OEM" -o device /dev/disk/by-path/pci-0000\:44\:00.0-ata-1*
/dev/disk/by-path/pci-0000:44:00.0-ata-1-part6
/dev/disk/by-path/pci-0000:44:00.0-ata-1.0-part6
root@localhost ~ # blkid -l -t "LABEL=OEM" -o device /dev/disk/by-path/pci-0000\:44\:00.0-ata-1
/dev/sdb6

By the way, here are the SATA controller details, in case anyone needs them:

44:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) (prog-if 01 [AHCI 1.0])
	Subsystem: Super Micro Computer Inc H12SSL-i [15d9:7901]
	Flags: bus master, fast devsel, latency 0, IRQ 91, IOMMU group 43
	Memory at b0600000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=16/16 Maskable- 64bit+
	Capabilities: [d0] SATA HBA v1.0
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2a0] Access Control Services
	Capabilities: [400] Data Link Feature <?>
	Capabilities: [410] Physical Layer 16.0 GT/s <?>
	Capabilities: [440] Lane Margining at the Receiver <?>
	Kernel driver in use: ahci
	Kernel modules: ahci

Great to hear that it solves your issue. I will draft a PR to get some comments and see how I can improve. For the moment, can you use the fork code or you need the patch asap in the main?

Thanks.

I'm not in a hurry, using the device with the longest name to only get a single match is a viable workaround for me until this fix hits the next release.

I mostly wanted to report this as the error is a bit cryptic, the default mount behavior with no filesystem hint ends up detecting this broken path as an NFS share, which results in spending a lot of time trying to mount the OEM partition as such before hitting timeout and returning with a misleading error.

Anyway, thanks a lot for the quick solution.