flatcar/Flatcar

systemd unit `ldconfig.service` fails to start

Opened this issue · 6 comments

Description

systemd unit ldconfig.service fails to start sometimes on the first boot or the subsequent reboots, but it cannot be reproduced reliably.
ldconfig.service runs before the switchroot during the initrd stage and is more cumbersome to properly reproduce it. [UPDATE] ldconfig.service runs after switchroot.

This is an issue that I have seen in the wild for a while, usually after rebooting a Flatcar instance on ARM64. From what I know, this issue does not affect the functionality of the Flatcar instance.

This issue has made a recurrence recently in the Github Actions. From the GitHub Actions:

  L1: "  "
    L2: " Error: _raid.go:245: could not reboot machine: machine __a8565d1f-ee8d-4220-9dbb-92c8f030eefb__ failed basic checks: some systemd units failed:"
    L3: "??? ldconfig.service loaded failed failed Rebuild Dynamic Linker Cache"
    L4: "status: "
    L5: "journal:-- No entries --"
    L6: "harness.go:593: Found systemd unit failed to start (?[0;1;39mldconfig.service?[0m - Rebuild Dynamic Linker Cache.  ) on machine a8565d1f-ee8d-4220-9dbb-92c8f030eefb console_"
    L7: " "

Debugging this issue, I could get a warning message by running manually ldconfig -X, which had exit code 0 and error message: Message: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes ....
This issue might be due to the wrong path for the /lib/ld.so.conf and it might be related to this commit: flatcar/scripts@ba45a2b.
I do not know yet if the warning messsage and the systemd unit failure are related, but maybe the warning message sometimes is seen as error output and thus fails the unit.

The definition of the systemd unit:

systemctl cat ldconfig.service
# /usr/lib/systemd/system/ldconfig.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Rebuild Dynamic Linker Cache
Documentation=man:ldconfig(8)

ConditionNeedsUpdate=|/etc
ConditionFileNotEmpty=|!/etc/ld.so.cache

DefaultDependencies=no
After=local-fs.target
Before=sysinit.target systemd-update-done.service
Conflicts=shutdown.target initrd-switch-root.target
Before=shutdown.target initrd-switch-root.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/ldconfig -X

Impact

The test framework needs to re-run the test, sometimes 3 or 4 times. In real world scenarios, because of the systemd unit failure, automation might break or sanitiy check tools might flag this issue.

Environment and steps to reproduce

Example test run that had to retry some of the Mantle tests:
https://github.com/flatcar/scripts/actions/runs/9777950641?pr=2089

Expected behavior

systemd unit ldconfig.service should not fail.

I had this issue once in a VM and I suspected a memory issue.

I could reliably reproduce the issue by bisecting the RAM size of a qemu-arm64 vm on arm64. 256MB of RAM, the system OOMs and does not boot. 512 MB of RAM, the system runs fine. But on 384 MB of RAM, I got this in the journalctl logs:

Jul 05 09:22:09 localhost systemd-coredump[1517]: Process 1378 (ldconfig) of user 0 dumped core.

                                                  Stack trace of thread 1378:
                                                  #0  0x0000ffff8f15331c n/a (n/a + 0x0)
                                                  #1  0x0000ffff8f153308 n/a (n/a + 0x0)
                                                  #2  0x0000ffff8f145610 n/a (n/a + 0x0)
                                                  #3  0x0000ffff8f137f18 n/a (n/a + 0x0)
                                                  #4  0x0000ffff8f13ac08 n/a (n/a + 0x0)
                                                  #5  0x0000ffff8f13bf9c n/a (n/a + 0x0)
                                                  #6  0x0000ffff8f138b40 n/a (n/a + 0x0)
                                                  #7  0x0000ffff8f13dfc0 n/a (n/a + 0x0)
                                                  #8  0x0000ffff8f13e2e4 n/a (n/a + 0x0)
                                                  #9  0x0000ffff8f1391b0 n/a (n/a + 0x0)
                                                  ELF object binary architecture: AARCH64
Jul 05 09:22:09 localhost systemd[1]: Started systemd-userdbd.service - User Database Manager.
Jul 05 09:22:09 localhost systemd[1]: ldconfig.service: Main process exited, code=dumped, status=6/ABRT
Jul 05 09:22:09 localhost systemd[1]: ldconfig.service: Failed with result 'core-dump'.
Jul 05 09:22:09 localhost systemd[1]: Failed to start ldconfig.service - Rebuild Dynamic Linker Cache.

ldconfig runs after switch-root:

$ journalctl -b0 | grep -Ee 'switch-root|ldconfig'
Jul 05 09:50:26 localhost systemd[1]: Reached target initrd-switch-root.target - Switch Root.
Jul 05 09:50:26 localhost systemd[1]: Starting initrd-switch-root.service - Switch Root...
Jul 05 09:50:27 localhost systemd[1]: initrd-switch-root.service: Deactivated successfully.
Jul 05 09:50:27 localhost systemd[1]: Stopped initrd-switch-root.service - Switch Root.
Jul 05 09:50:27 localhost systemd[1]: Stopped target initrd-switch-root.target - Switch Root.
Jul 05 09:50:27 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=initrd-switch-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 05 09:50:27 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=initrd-switch-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 05 09:50:28 localhost systemd[1]: Starting ldconfig.service - Rebuild Dynamic Linker Cache...
Jul 05 09:50:29 localhost ldconfig[1230]: /sbin/ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.
Jul 05 09:50:29 localhost systemd[1]: Finished ldconfig.service - Rebuild Dynamic Linker Cache.
Jul 05 09:50:29 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=ldconfig comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

The segfault due to low memory may be different than other failure modes. I debugged the coredump for an occurance of this in tests once, and saw that it had to do with parsing /var/cache/ldconfig/aux-cache - something about a mismatch between device major/minor + inode in cache and on disk. I couldn't make sense of this any further at the time.

plot1
@jepio I added the systemd-analyze plot to see the proper picture with ldconfig starting after switch-root. I do not understand why this condition is also present in the ldconfig.service definition: Before=shutdown.target initrd-switch-root.target.

Added a png for the systemd plot, in case the svg does not work.
plot1

chewi commented

I doubt this is memory-related. I just saw this on a local 2GB QEMU instance. It seems random, it worked many other times.

I think the /lib/ld.so.conf warning is unrelated. I had previously considered this an odd location to put this. I thought about suggesting /usr/etc/ld.so.conf instead as the bfd linker uses this by default when building, falling back to /etc/ld.so.confg.