lsds/sgx-lkl

Segfault during SGX-LKL shutdown

prp opened this issue · 8 comments

prp commented

I observed the following failure during a CI run of kernel-syscalls-mmap-mmap11-(nonrelease)-(run-hw)-(8-ethreads):

[   SGX-LKL  ] Warning: SGXLKL_TRACE_SIGNAL ignored in non-debug mode.
[   SGX-LKL  ] nproc=4 ETHREADS=8 CMDLINE="mem=512m"
[   SGX-LKL  ] HW TLS support: econf->fsgsbase=1
[   SGX-LKL  ] Registering disk 0 (path='sgxlkl-miniroot-fs.img', mnt='/', [RW])
[   SGX-LKL  ] No tap device specified, networking will not be available.
[   SGX-LKL  ] get_signed_libsgxlkl_path... result=/opt/sgx-lkl-nonrelease/bin/../lib/libsgxlkl.so.signed
[   SGX-LKL  ] oe_create_enclave...
[   SGX-LKL  ] Enclave config: {"args":["/ltp/testcases/kernel/syscalls/mmap/mmap11"],"clock_res":[{"resolution":"0000000000000001"},{"resolution":"0000000000000001"},{"resolution":"0000000000000000"},{"resolution":"0000000000000000"},{"resolution":"0000000000000001"},{"resolution":"00000000003d0900"},{"resolution":"00000000003d0900"},{"resolution":"0000000000000001"}],"cwd":"/","env":[],"esleep":"16000","espins":"500","ethreads":"8","exit_status":"full","format_version":"1","fsgsbase":true,"host_import_env":[],"hostname":"lkl","hostnet":false,"image_sizes":{"num_heap_pages":"262144","num_stack_pages":"1024"},"io":{"block":true,"console":true,"network":true},"kernel_cmd":"mem=512m","kernel_verbose":true,"max_user_threads":"256","mmap_files":"shared","mode":"hw_debug","mounts":[],"net_gw4":"10.0.1.254","net_ip4":"10.0.1.1","net_mask4":"24","oe_heap_pagecount":"8192","root":{"key":"","key_id":null,"overlay":false,"readonly":false,"roothash":null,"roothash_offset":"0"},"stacksize":"524288","swiotlb":true,"sysctl":null,"tap_mtu":"0","verbose":true,"wg":{"ip":"10.0.2.1","key":null,"listen_port":"56002","peers":[]}}
result=0 (OE_OK)
[   SGX-LKL  ] sgxlkl_enclave_init(ethread_id=0)
[    0.000000] Linux version 5.3.0+ (AzDevOps@cisca24ba0002V4) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #1 Thu Aug 6 09:09:59 UTC 2020
[    0.000000] memblock address range: 0x7f00207c2000 - 0x7f00407c2000
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 129280
[    0.000000] Kernel command line: mem=512m console=hvc0 virtio_mmio.device=256@0x1000000:1
[    0.000000] Dentry cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory available: 516256k/524288k RAM
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 4096
[    0.000000] lkl: irqs initialized
[    0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] lkl: time and timers initialized (irq2)
[    0.000559] printk: console [hvc0] enabled
[    0.000559] pid_max: default: 4096 minimum: 301
[    0.000560] Mount-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[    0.000560] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[    0.002735] devtmpfs: initialized
[    0.018031] random: get_random_bytes called from 0x00007f000057a455 with crng_init=0
[    0.018032] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.018033] futex hash table entries: 16 (order: -4, 384 bytes, linear)
[    0.018584] xor: automatically using best checksumming function   8regs     
[    0.018586] NET: Registered protocol family 16
[    0.021872] clocksource: Switched to clocksource lkl
[    0.022983] NET: Registered protocol family 2
[    0.022985] tcp_listen_portaddr_hash hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.022985] TCP established hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.022986] TCP bind hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.022986] TCP: Hash tables configured (established 4096 bind 4096)
[    0.023539] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.023539] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.023540] NET: Registered protocol family 1
[    0.023541] virtio-mmio: Registering device virtio-mmio.0 at 0x1000000-0x10000ff, IRQ 1.
[    0.024095] workingset: timestamp_bits=62 max_order=18 bucket_order=0
[    0.025208] NET: Registered protocol family 38
[    0.025208] io scheduler mq-deadline registered
[    0.025759] io scheduler kyber registered
[    0.113409] software IO TLB: mapped [mem 0x7f10322c1000-0x7f10362c1000] (64MB)
[    0.116643] loop: module loaded
[    0.117200] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
[    0.126234] wireguard: WireGuard 0.0.20191219 loaded. See www.wireguard.com for information.
[    0.126235] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[    0.126789] NET: Registered protocol family 10
[    0.127333] Segment Routing with IPv6
[    0.127886] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    0.128946] This architecture does not have kernel memory protection.
[    0.129500] Run /init as init process
[    0.131722] random: crng init done
[[  SGX-LKL ]] wg0 has public key W5/XFYlRAoayhmNTBF1ikKfZZapwu2VaDBW1TXPfkB0=
[    0.133394] virtio_blk virtio1: [vda] 3072000 512-byte logical blocks (1.57 GB/1.46 GiB)
[    0.135517] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: 
mmap11      1  [1;32mTPASS[0m  :  test completed.
[    0.147612] reboot: Restarting system
[   SGX-LKL  ] ethread (5: 0) ethread (6: 0) ethread (3: 0) ethread (2: 0) init (0: 0 exit=0) [[  SGX-LKL ]] FAIL: ethread (4: 0) Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=-1 [(?)] code=5 addr=0x7f00000ef327 opcode=0x3883 ret=0)
2020-08-06T09:36:05.000000Z [(H)ERROR] tid(0x7f10277fe700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
ethread (7: 19) [   SGX-LKL  ] FAIL: sgxlkl_ethread_init() failed (id=7 result=19 (OE_ENCLAVE_ABORTING))
TEST_FAILED EXIT CODE: 1

This happens to one of the ethreads, after 6 of the 8 ethreads have already exited the enclave. Note that the signal handler cannot retrieve the tid anymore.

I suspect that the test here is irrelevant, and this is rather a rare race condition/memory corruption during shutdown. It would be good to observe this for a DEBUG build and get a stack trace. It seems different from the other issues that we have seen.

perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720

https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results

prp commented

perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720

https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results

Interesting. All of the segfaults for #720 happen in SW mode. Do you have a gdb stack trace of the segfaults in #720? It may be easy to fix.

Here's one:

adjtimex01.c:24: PASS: adjtimex() with mode 0x403f
adjtimex01.c:34: PASS: adjtimex() with mode 0x8001
[    0.145277] reboot: Restarting system
[Thread 0x7fe040f07000 (LWP 1496) exited]

Thread 4 "sgx-lkl-run-oe" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff93aeb700 (LWP 1493)]
0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
285         while (q->last_avail_idx != q->avail->idx)
(gdb) bt
#0  0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
#1  0x00000000400169c9 in console_task (arg=0x0) at host_interface/virtio_console.c:326
#2  0x00007ffff72e66db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff700fa3f in clone () from /lib/x86_64-linux-gnu/libc.so.6

I'm checking to see if other tests look similar.

@prp I checked the stack trace for 8 different failing LTP tests. They are all the same.

My guess @prp from looking at a little code is that we have this in console_task:

    while (1)
    {
        vio_host_process_enclave_event(_cfg->dev_id, timeout_ms);

        if (vio_host_check_guest_shutdown_evt())
            continue;

        virtio_process_queue(dev, TX_QUEUE_ID);
    }

and I suspect that after the shutdown check, we are starting the shutdown while we are trying to still use the queue. But I don't know this part of the code at all, so that is pure conjecture at this point. I believe you are more familiar with the virtio code and its interactions.

Do you have any insights?

prp commented

I think you're right. By the way, this appears a duplicate of #158.

Here's another that might be related:

https://dev.azure.com/sgx-lkl/ff25f828-9f87-48e4-94f0-7449609f7e8f/_apis/build/builds/1723/logs/148

during 'report/TEST-basic-global_vars_test-(nonrelease)-(run-hw)-(8-ethreads)-junit.xml'

2020-08-06T15:53:10.1702624Z [    0.162465] EXT4-fs (vda): mounted filesystem without journal. Opts: 
2020-08-06T15:53:10.1703541Z fopen failed errno(2)
2020-08-06T15:53:10.1703688Z TEST PASSED
2020-08-06T15:53:10.1704359Z [[  SGX-LKL ]] FAIL: Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=1 [global_vars_test] code=5 addr=0x7f80001058b1 opcode=0x78b ret=0)
2020-08-06T15:53:10.1705040Z 2020-08-06T15:53:10.000000Z [(H)ERROR] tid(0x7f9c5d435700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
2020-08-06T15:53:10.1706352Z [   SGX-LKL  ] ethread (6: 19) [   SGX-LKL  ] FAIL: sgxlkl_ethread_init() failed (id=6 result=19 (OE_ENCLAVE_ABORTING))
2020-08-06T15:53:10.1706651Z [    0.185198] reboot: Restarting system
2020-08-06T15:53:10.1707734Z Makefile:30: recipe for target 'run-hw' failed
2020-08-06T15:53:10.1708090Z make: *** [run-hw] Error 1
2020-08-06T15:53:10.1708540Z make run-hw: FAILED WITH EXIT CODE: 2
2020-08-06T15:53:10.1708730Z Test run completed with EXIT CODE 2
2020-08-06T15:53:10.1710290Z -----------stdout-end-------------
prp commented

This may get fixed by PR #770.