ClangBuiltLinux/linux

51d334a845a082338735b0fdfc620a4b15fa26fe breaks x86_64 ThinLTO + dynamic ftrace

samitolvanen opened this issue · 17 comments

Starting with LLVM commit 51d334a845a082338735b0fdfc620a4b15fa26fe ("[NFCI] Lazily evaluate SCEVs of PHIs"), ToT x86_64 defconfig + ThinLTO + dynamic ftrace fails to boot:

[    0.639299] =====================================
[    0.639718] WARNING: bad unlock balance detected!
[    0.640137] 5.13.0-rc3+ #1 Not tainted
[    0.640143] -------------------------------------
[    0.640143] swapper/0/1 is trying to release lock (
[    0.640143] general protection fault, probably for non-canonical address 0x25453be8082474ff: 0000 [#1] SMP PTI
[    0.640143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc3+ #1
[    0.640143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[    0.640143] RIP: 0010:string+0x50/0x110
[    0.640143] Code: ce 48 c1 fe 30 85 f6 0f 84 be 00 00 00 31 c0 eb 15 66 0f 1f 84 00 00 00 00 00 48 83 c0 01 39 c6 0f 84 ac 00 00 00 4c 81
[    0.640143] RSP: 0000:ffff927ec0013af0 EFLAGS: 00010046
[    0.640143] RAX: 0000000000000000 RBX: ffff927ec0013bea RCX: ffff0a00ffffff04
[    0.640143] RDX: ffff927ec0013bf0 RSI: ffffffffffffffff RDI: ffff927ec0013bea
[    0.640143] RBP: ffffffff8a2f044d R08: 25453be8082474ff R09: ffff927ec0013bea
[    0.640143] R10: 0000001000000000 R11: 0000000400000000 R12: ffff927ec0013bf0
[    0.640143] R13: ffffffff8a2f044f R14: ffff0a00ffffff04 R15: 0000000000000002
[    0.640143] FS:  0000000000000000(0000) GS:ffff90ff5f400000(0000) knlGS:0000000000000000
[    0.640143] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.640143] CR2: ffff90ff56001000 CR3: 0000000014a22000 CR4: 00000000000006f0
[    0.640143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.640143] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.640143] Call Trace:
[    0.640143]  vsnprintf+0x615/0x750
[    0.640143]  vprintk_store+0xb8/0x4f0
[    0.640143]  ? vprintk_emit+0x103/0x180
[    0.640143]  ? vprintk_emit+0x103/0x180
[    0.640143]  vprintk_emit+0x89/0x180
[    0.640143]  ? _raw_read_lock+0x14/0x50
[    0.640143]  vprintk+0x12e/0x150
[    0.640143]  ? _raw_read_lock+0x14/0x50
[    0.640143]  printk+0x59/0x78
[    0.640143]  print_unlock_imbalance_bug+0x122/0x190
[    0.640143]  ? lockdep_hardirqs_on_prepare+0x187/0x280
[    0.640143]  ? _raw_read_lock+0x14/0x50
[    0.640143]  ? _raw_read_lock+0x14/0x50
[    0.640143]  lock_release+0x18f/0x2a0
[    0.640143]  ? _raw_spin_unlock_irq+0x24/0x30
[    0.640143]  percpu_up_write+0x1a/0x50
[    0.640143]  ? _cpu_up+0x2e/0x260
[    0.640143]  ? cpu_up.llvm.6178664248410513755+0x6a/0xa0
[    0.640143]  ? kernel_init+0xa/0x1b0
[    0.640143]  ? bringup_nonboot_cpus+0x6a/0x70
[    0.640143]  ? smp_init+0x28/0x7f
[    0.640143]  ? kernel_init_freeable+0x147/0x1b5
[    0.640143]  ? rest_init+0x1e0/0x1e0
[    0.640143]  ? kernel_init+0xa/0x1b0
[    0.640143]  ? ret_from_fork+0x22/0x30
[    0.640143] Modules linked in:
[    0.640143] Dumping ftrace buffer:
[    0.640143]    (ftrace buffer empty)
[    0.640143] ---[ end trace a4719c0acf67613d ]---
[    0.640143] RIP: 0010:string+0x50/0x110
[    0.640143] Code: ce 48 c1 fe 30 85 f6 0f 84 be 00 00 00 31 c0 eb 15 66 0f 1f 84 00 00 00 00 00 48 83 c0 01 39 c6 0f 84 ac 00 00 00 4c 81
[    0.640143] RSP: 0000:ffff927ec0013af0 EFLAGS: 00010046
[    0.640143] RAX: 0000000000000000 RBX: ffff927ec0013bea RCX: ffff0a00ffffff04
[    0.640143] RDX: ffff927ec0013bf0 RSI: ffffffffffffffff RDI: ffff927ec0013bea
[    0.640143] RBP: ffffffff8a2f044d R08: 25453be8082474ff R09: ffff927ec0013bea
[    0.640143] R10: 0000001000000000 R11: 0000000400000000 R12: ffff927ec0013bf0
[    0.640143] R13: ffffffff8a2f044f R14: ffff0a00ffffff04 R15: 0000000000000002
[    0.640143] FS:  0000000000000000(0000) GS:ffff90ff5f400000(0000) knlGS:0000000000000000
[    0.640143] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.640143] CR2: ffff90ff56001000 CR3: 0000000014a22000 CR4: 00000000000006f0
[    0.640143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.640143] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.640143] Kernel panic - not syncing: Fatal exception
[    0.640143] Dumping ftrace buffer:  
[    0.640143]    (ftrace buffer empty)

The kernel was configured as follows:

$ make ARCH=x86_64 LLVM=1 LLVM_IAS=1 defconfig
# ThinLTO + dynamic ftrace + CBL CI configs
$ ./scripts/config -e LTO_CLANG_THIN -d LTO_CLANG_FULL -d GCOV_KERNEL -d KASAN -e FTRACE_STARTUP_TEST -e FUNCTION_TRACER -e DYNAMIC_FTRACE -e FTRACE -e TRACING -e LOCK_TORTURE_TEST -e RCU_TORTURE_TEST -e EXPERT -e DEBUG_KERNEL -e LOCK_DEBUGGING_SUPPORT -e DEBUG_RT_MUTEXES -e DEBUG_SPINLOCK -e DEBUG_MUTEXES -e DEBUG_WW_MUTEX_SLOWPATH -e DEBUG_LOCK_ALLOC -e DEBUG_LOCKDEP -e DEBUG_ATOMIC_SLEEP -e CRYPTO_MANAGER_DISABLE_TESTS -e CRC32_SELFTEST -e DEBUG_LOCKING_API_SELFTESTS -e DEBUG_NMI_SELFTEST -e DEBUG_RODATA_TEST -e DEBUG_TLBFLUSH -e DMATEST -e GLOB_SELFTEST -e PCI_EPF_TEST -e PCI_ENDPOINT_TEST -e PROVE_LOCKING -e PROVE_RCU -e RBTREE_TEST -e RCU_EQS_DEBUG -e STATIC_KEYS_SELFTEST -e STRING_SELFTEST -e TEST_BITMAP -e TEST_FIRMWARE -e TEST_SORT -e TEST_SYSCTL -e TEST_UUID -e TORTURE_TEST -e USB_TEST -e USB_EHSET_TEST_FIXTURE -e USB_LINK_LAYER_TEST -e WW_MUTEX_SELFTEST -e BLK_DEV_INITRD -e DEVTMPFS -e DEVTMPFS_MOUNT -e SCSI -e BLK_DEV_SD -e SCSI_LOWLEVEL -e SCSI_DC395x -e SCSI_AM53C974 -e SCSI_SYM53C8XX_2 -e MEGARAID_SAS -e FUSION -e FUSION_SAS -e MMC -e MMC_SDHCI -e MMC_SDHCI_PCI -e BLK_DEV_NVME -e USB -e USB_XHCI_HCD -e USB_EHCI_HCD -e USB_OHCI_HCD -e USB_STORAGE -e USB_UAS -e VIRTIO -e VIRTIO_PCI -e VIRTIO_PCI_LEGACY -e VIRTIO_BALLOON -e VIRTIO_MMIO -e BLK_MQ_VIRTIO -e VIRTIO_BLK -e VIRTIO_BLK_SCSI -e SCSI_VIRTIO
$ make ARCH=x86_64 LLVM=1 LLVM_IAS=1 olddefconfig

During the build I observed the following objtool warnings, which were not present with earlier compiler versions:

  LTO     vmlinux.o
  OBJTOOL vmlinux.o
vmlinux.o: warning: objtool: .text.percpu_down_write: unexpected end of section
vmlinux.o: warning: objtool: .text.try_check_zero: unexpected end of section

Tested with kernel commit 7ac3a1c.

cc @nickdesaulniers @nathanchance

Does your build of LLVM have assertions enabled? It would be interesting to see if there were any new triggers. The diff of those sections before and after that commit might be interesting as well. Does not look like that commit got any precommit review so it should probably be reverted while this is investigated.

Looks like that patch is a precursor to this patch: https://reviews.llvm.org/D102615#2784051

Does your build of LLVM have assertions enabled? It would be interesting to see if there were any new triggers.

After rebuilding LLVM with -DCMAKE_BUILD_TYPE=Debug and -DLLVM_ENABLE_ASSERTIONS=1, I don't see any new warnings or errors when building the kernel.

The diff of those sections before and after that commit might be interesting as well.

percpu_down_write.59d938e649e62db0cef4903d495e838fbc6a6eb8.txt
percpu_down_write.51d334a845a082338735b0fdfc620a4b15fa26fe.txt

try_check_zero.59d938e649e62db0cef4903d495e838fbc6a6eb8.txt
try_check_zero.51d334a845a082338735b0fdfc620a4b15fa26fe.txt

No idea how it happens but we seem to have lost the last call to _raw_spin_lock_irqsave, which I assume is why everything explodes...

Before:

 100:	5b                   	pop    %rbx
 101:	41 5e                	pop    %r14
 103:	41 5f                	pop    %r15
 105:	5d                   	pop    %rbp
 106:	e9 00 00 00 00       	jmpq   10b <percpu_down_write+0x10b>
			107: R_X86_64_PLT32	.text.finish_rcuwait+0x5c
 10b:	e8 00 00 00 00       	callq  110 <_raw_spin_lock_irqsave>
			10c: R_X86_64_PLT32	schedule-0x4

After:

 117:	5b                   	pop    %rbx
 118:	41 5c                	pop    %r12
 11a:	41 5d                	pop    %r13
 11c:	41 5e                	pop    %r14
 11e:	41 5f                	pop    %r15
 120:	5d                   	pop    %rbp
 121:	e9 00 00 00 00       	jmpq   126 <i915_pipestat_enable_mask.__warned.9>
			122: R_X86_64_PLT32	.text.finish_rcuwait+0x5c

Not sure about the try_check_zero section warning though.

It's actually the broken commit 51d334a845a082338735b0fdfc620a4b15fa26fe that has the _raw_spin_lock_irqsave call at the end of that function:

kernel-build.51d334a845a082338735b0fdfc620a4b15fa26fe/vmlinux.o:     file format elf64-x86-64


Disassembly of section .text.percpu_down_write:

0000000000000000 <percpu_down_write>:
...
 106:	e9 00 00 00 00       	jmpq   10b <percpu_down_write+0x10b>
			107: R_X86_64_PLT32	.text.finish_rcuwait+0x5c
 10b:	e8 00 00 00 00       	callq  110 <_raw_spin_lock_irqsave>
			10c: R_X86_64_PLT32	schedule-0x4

59d938e649e62db0cef4903d495e838fbc6a6eb8 is the previous commit, which still produces a working kernel.

Also note that the disassembly above is after objtool. Let me rebuild and see if there's something obvious that trips it up.

Starting with LLVM commit 51d334a845a082338735b0fdfc620a4b15fa26fe ("[NFCI] Lazily evaluate SCEVs of PHIs"), ToT x86_64 defconfig + ThinLTO + dynamic ftrace fails to boot

I guess it wasn't "No Functional Change Intended" then.

Looks like objtool just correctly removed the __fentry__ call in both cases. Otherwise the functions are identical before and after. For example:

 0000000000000000 <percpu_down_write>:
-   0:  e8 00 00 00 00          callq  5 <percpu_down_write+0x5>
-                       1: R_X86_64_PLT32       __fentry__-0x4
+   0:  0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
+                       1: R_X86_64_NONE        __fentry__-0x4
    5:  55                      push   %rbp
    6:  41 57                   push   %r15
    8:  41 56                   push   %r14

No functional change was intended, must be a bug. Patch reverted. Sorry for disturbance!

Please confirm the failure is gone with

commit 6a2af607ad3523ddc3778b0efb7bb1d5d42a1edb (HEAD -> main, origin/main)
Author: Max Kazantsev <mkazantsev@azul.com>
Date:   Fri May 28 11:05:30 2021 +0700

    Revert "[NFCI] Lazily evaluate SCEVs of PHIs"

    This reverts commit 51d334a845a082338735b0fdfc620a4b15fa26fe.

    Reported failures, need to analyze.

Thanks, Max. I can confirm the kernel boots again after the revert.

Can we try to get a reduced test case for @max-azul ?

Looks like https://reviews.llvm.org/D102615 has a new revision. I will try to test that later today to make sure none of the LTO/CFI builds are broken with it.

https://reviews.llvm.org/D102615 passes a regular defconfig, defconfig + CONFIG_LTO_CLANG_THIN=y, and defconfig + CONFIG_LTO_CLANG_THIN=y + CONFIG_CFI_CLANG=y for arm64 and x86_64 at llvm/llvm-project@cfa9ae9 and samitolvanen@29d7af2.

Guessing that's Diff 6 | 348875.

Nathan, did you test that also with dynamic ftrace? Normal defconfig + LTO was fine with the earlier patch too.

@samitolvanen sorry, I missed testing with CONFIG_FUNCTION_TRACER=y, which should test CONFIG_DYNAMIC_FTRACE. I will do that right now.

Looks good to me with CONFIG_FUNCTION_TRACER=y as well.