canonical/lxd

`tests/vm-nesting` regression in lxd-ci

Closed this issue ยท 4 comments

tl;dr

vm-nesting (latest/edge - 24.04) hangs forever in the ==> Test 5 containers each with one nested VM part, but the same works with 5.21/edge.

Description

In lxd-ci daily runs tests/vm-nesting on latest/edge with 24.04 hangs for so long that is eventually times out. It hangs in the ==> Test 5 containers each with one nested VM part of the test as visible in those GHA logs with the highlight copied here:

2024-10-10T06:53:54.3188191Z ==> Test 5 containers each with one nested VM
2024-10-10T06:53:54.3190887Z + echo '==> Test 5 containers each with one nested VM'
2024-10-10T06:53:54.3191523Z + init 5
2024-10-10T06:53:54.3191856Z + vm=
2024-10-10T06:53:54.3192229Z + '[' -z '' ']'
2024-10-10T06:53:54.3193013Z + parallel 5 lxc init ubuntu-minimal-daily:24.04 't{}' -s default -n lxdbr0
2024-10-10T06:53:54.3194105Z + seq 5
2024-10-10T06:53:54.3199135Z + xargs -P 5 -I '{}' lxc init ubuntu-minimal-daily:24.04 't{}' -s default -n lxdbr0
2024-10-10T06:53:54.3894764Z Creating t4
2024-10-10T06:53:54.3896637Z Creating t2
2024-10-10T06:53:54.3936926Z Creating t5
2024-10-10T06:53:54.3989310Z Creating t3
2024-10-10T06:53:54.3996361Z Creating t1
...
2024-10-10T06:54:03.0405684Z + conf 5 security.devlxd.images=true
...
2024-10-10T06:55:15.3946994Z + cmd 5 'lxc launch ubuntu-minimal-daily:24.04 nested --vm -c limits.memory=510MiB'
2024-10-10T06:55:15.3948619Z + parallel 5 lxc exec 't{}' -- bash -c 'lxc launch ubuntu-minimal-daily:24.04 nested --vm -c limits.memory=510MiB'
2024-10-10T06:55:15.3951772Z + seq 5
2024-10-10T06:55:15.3952891Z + xargs -P 5 -I '{}' lxc exec 't{}' -- bash -c 'lxc launch ubuntu-minimal-daily:24.04 nested --vm -c limits.memory=510MiB'
2024-10-10T06:55:15.5815446Z Launching nested
2024-10-10T06:55:15.6170940Z Launching nested
2024-10-10T06:55:15.6206285Z Launching nested
2024-10-10T06:55:15.6257561Z Launching nested
2024-10-10T06:55:15.6602152Z Launching nested
2024-10-10T06:55:17.6666206Z 
2024-10-10T06:55:17.6713282Z Retrieving image: 55.28MB (55.28MB/s)
2024-10-10T06:55:17.6713992Z Retrieving image: 45.93MB (45.92MB/s)
2024-10-10T06:55:17.6836113Z Retrieving image: 61.02MB (60.91MB/s)
2024-10-10T06:55:17.6903960Z Retrieving image: 60.56MB (60.56MB/s)
2024-10-10T06:55:18.6685119Z Retrieving image: 54.76MB (54.76MB/s)
2024-10-10T06:55:18.6710195Z Retrieving image: 103.01MB (51.50MB/s)
2024-10-10T06:55:18.6721355Z Retrieving image: 102.04MB (51.01MB/s)
2024-10-10T06:55:18.6842711Z Retrieving image: 100.77MB (50.34MB/s)
2024-10-10T06:55:18.6905112Z Retrieving image: 108.95MB (54.47MB/s)
2024-10-10T06:55:19.6668565Z Retrieving image: 102.60MB (51.29MB/s)
...
2024-10-10T06:55:22.1518001Z Retrieving image: 227.42MB (45.12MB/s)
2024-10-10T06:55:22.1998891Z Retrieving image: Unpacking image: 100% (1.03GB/s)
2024-10-10T06:55:22.2895334Z Retrieving image: Unpacking image: 100% (822.36MB/s)
2024-10-10T06:55:22.2898206Z Retrieving image: Unpacking image: 100% (1.08GB/s)
2024-10-10T06:55:22.3163226Z Retrieving image: Unpacking image: 100% (1.08GB/s)
2024-10-10T06:55:23.0126800Z Retrieving image: Unpacking image: 100% (1.25GB/s)
2024-10-10T06:55:23.0133647Z                                                    
2024-10-10T06:55:23.0238021Z Retrieving image: Unpacking image: 1%
...
2024-10-10T06:55:34.0655321Z                                                      
2024-10-10T06:55:34.3042505Z Retrieving image: Unpacking image: 99%
2024-10-10T06:55:34.3048067Z                                                
2024-10-10T12:43:22.1280441Z ##[error]The operation was canceled.
...

Notice the ~6h delay between the 2 last time stamps.

The tests/vm-nesting code was changed a few days ago but that only affects the sub test ==> Test ${nestedVMs} VMs each with one nested VM which should be unrelated to the "container running VM failure" described above. Since 5.21/edge on the same OS (24.04) has no issue, it points to a regression in LXD.

@mihalicyn would you mind taking a look at this one?

As I see container (latest/edge - 24.04) also fails and I was able to reproduce this locally.

Reproducer (on Ubuntu Noble):

lxc launch ubuntu-minimal-daily:22.04 n1 -c security.nesting=true -c security.devlxd.images=true
lxc exec n1 -- snap install lxd --channel=latest/edge
lxc exec n1 -- lxd init --auto
lxc exec n1 -- lxc launch ubuntu-minimal-daily:22.04 n11

More info about error:

root@n1:~# lxc start n11
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart n11 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/n11/lxc.conf: exit status 1
Try `lxc info --show-log n11` for more info
root@n1:~# lxc info --show-log n11
Name: n11
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2024/10/11 14:18 UTC
Last Used: 2024/10/11 14:29 UTC

Log:

lxc n11 20241011142945.265 ERROR    cgroup2_devices - ../src/src/lxc/cgroups/cgroup2_devices.c:bpf_program_load_kernel:332 - Operation not permitted - Failed to load bpf program: (null)
lxc n11 20241011142945.332 ERROR    conf - ../src/src/lxc/conf.c:lxc_pivot_root:1401 - Permission denied - Failed to pivot into new root directory "/var/snap/lxd/common/lxc/"
lxc n11 20241011142945.332 ERROR    conf - ../src/src/lxc/conf.c:lxc_setup:3986 - Failed to pivot root into rootfs
lxc n11 20241011142945.332 ERROR    start - ../src/src/lxc/start.c:do_start:1273 - Failed to setup container "n11"
lxc n11 20241011142945.332 ERROR    sync - ../src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc n11 20241011142945.332 WARN     network - ../src/src/lxc/network.c:lxc_delete_network_priv:3674 - Failed to rename interface with index 0 from "eth0" to its initial name "veth912989af"
lxc n11 20241011142945.332 ERROR    lxccontainer - ../src/src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc n11 20241011142945.333 ERROR    start - ../src/src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "n11"
lxc n11 20241011142945.333 WARN     start - ../src/src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 2289
lxc 20241011142945.431 ERROR    af_unix - ../src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20241011142945.431 ERROR    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

and:

[ 2018.960790] audit: type=1400 audit(1728658069.488:428): apparmor="DENIED" operation="pivotroot" class="mount" namespace="root//lxd-n1_<var-snap-lxd-common-lxd>" profile="snap.lxd.daemon" name="/var/snap/lxd/common/lxc/" pid=8183 comm="lxd" srcname="/var/snap/lxd/common/lxc/"

so it's AppArmor-related. I can guess that it's one of the forms of https://bugs.launchpad.net/apparmor/+bug/2067900

at the same time, when I do lxc exec n1 -- snap install lxd instead of snap install lxd --channel=latest/edge it works.
Which points that it's logically connected to our switch to unconfined profiles canonical/lxd-pkg-snap#490

OS info:

root@ubuntu:/home/ubuntu# uname -a
Linux ubuntu 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@ubuntu:/home/ubuntu# cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"

I can confirm that the following kernel patch:

$ git diff
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 74b7293ab971..b12e6bdfefb2 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -678,7 +678,7 @@ static struct aa_label *build_pivotroot(const struct cred *subj_cred,
        AA_BUG(!new_path);
        AA_BUG(!old_path);
 
-       if (!RULE_MEDIATES(rules, AA_CLASS_MOUNT))
+       if (profile_unconfined(profile) || !RULE_MEDIATES(rules, AA_CLASS_MOUNT))
                return aa_get_newest_label(&profile->label);
 
        error = aa_path_name(old_path, path_flags(profile, old_path),

fixes the problem with nested containers. That's the same I found earlier in scope of https://bugs.launchpad.net/apparmor/+bug/2067900