Segfault during SGX-LKL shutdown
prp opened this issue · 8 comments
I observed the following failure during a CI run of kernel-syscalls-mmap-mmap11-(nonrelease)-(run-hw)-(8-ethreads)
:
[ SGX-LKL ] Warning: SGXLKL_TRACE_SIGNAL ignored in non-debug mode.
[ SGX-LKL ] nproc=4 ETHREADS=8 CMDLINE="mem=512m"
[ SGX-LKL ] HW TLS support: econf->fsgsbase=1
[ SGX-LKL ] Registering disk 0 (path='sgxlkl-miniroot-fs.img', mnt='/', [RW])
[ SGX-LKL ] No tap device specified, networking will not be available.
[ SGX-LKL ] get_signed_libsgxlkl_path... result=/opt/sgx-lkl-nonrelease/bin/../lib/libsgxlkl.so.signed
[ SGX-LKL ] oe_create_enclave...
[ SGX-LKL ] Enclave config: {"args":["/ltp/testcases/kernel/syscalls/mmap/mmap11"],"clock_res":[{"resolution":"0000000000000001"},{"resolution":"0000000000000001"},{"resolution":"0000000000000000"},{"resolution":"0000000000000000"},{"resolution":"0000000000000001"},{"resolution":"00000000003d0900"},{"resolution":"00000000003d0900"},{"resolution":"0000000000000001"}],"cwd":"/","env":[],"esleep":"16000","espins":"500","ethreads":"8","exit_status":"full","format_version":"1","fsgsbase":true,"host_import_env":[],"hostname":"lkl","hostnet":false,"image_sizes":{"num_heap_pages":"262144","num_stack_pages":"1024"},"io":{"block":true,"console":true,"network":true},"kernel_cmd":"mem=512m","kernel_verbose":true,"max_user_threads":"256","mmap_files":"shared","mode":"hw_debug","mounts":[],"net_gw4":"10.0.1.254","net_ip4":"10.0.1.1","net_mask4":"24","oe_heap_pagecount":"8192","root":{"key":"","key_id":null,"overlay":false,"readonly":false,"roothash":null,"roothash_offset":"0"},"stacksize":"524288","swiotlb":true,"sysctl":null,"tap_mtu":"0","verbose":true,"wg":{"ip":"10.0.2.1","key":null,"listen_port":"56002","peers":[]}}
result=0 (OE_OK)
[ SGX-LKL ] sgxlkl_enclave_init(ethread_id=0)
[ 0.000000] Linux version 5.3.0+ (AzDevOps@cisca24ba0002V4) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #1 Thu Aug 6 09:09:59 UTC 2020
[ 0.000000] memblock address range: 0x7f00207c2000 - 0x7f00407c2000
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 129280
[ 0.000000] Kernel command line: mem=512m console=hvc0 virtio_mmio.device=256@0x1000000:1
[ 0.000000] Dentry cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory available: 516256k/524288k RAM
[ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] NR_IRQS: 4096
[ 0.000000] lkl: irqs initialized
[ 0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000000] lkl: time and timers initialized (irq2)
[ 0.000559] printk: console [hvc0] enabled
[ 0.000559] pid_max: default: 4096 minimum: 301
[ 0.000560] Mount-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[ 0.000560] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[ 0.002735] devtmpfs: initialized
[ 0.018031] random: get_random_bytes called from 0x00007f000057a455 with crng_init=0
[ 0.018032] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.018033] futex hash table entries: 16 (order: -4, 384 bytes, linear)
[ 0.018584] xor: automatically using best checksumming function 8regs
[ 0.018586] NET: Registered protocol family 16
[ 0.021872] clocksource: Switched to clocksource lkl
[ 0.022983] NET: Registered protocol family 2
[ 0.022985] tcp_listen_portaddr_hash hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.022985] TCP established hash table entries: 4096 (order: 3, 32768 bytes, linear)
[ 0.022986] TCP bind hash table entries: 4096 (order: 3, 32768 bytes, linear)
[ 0.022986] TCP: Hash tables configured (established 4096 bind 4096)
[ 0.023539] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[ 0.023539] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[ 0.023540] NET: Registered protocol family 1
[ 0.023541] virtio-mmio: Registering device virtio-mmio.0 at 0x1000000-0x10000ff, IRQ 1.
[ 0.024095] workingset: timestamp_bits=62 max_order=18 bucket_order=0
[ 0.025208] NET: Registered protocol family 38
[ 0.025208] io scheduler mq-deadline registered
[ 0.025759] io scheduler kyber registered
[ 0.113409] software IO TLB: mapped [mem 0x7f10322c1000-0x7f10362c1000] (64MB)
[ 0.116643] loop: module loaded
[ 0.117200] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
[ 0.126234] wireguard: WireGuard 0.0.20191219 loaded. See www.wireguard.com for information.
[ 0.126235] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 0.126789] NET: Registered protocol family 10
[ 0.127333] Segment Routing with IPv6
[ 0.127886] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 0.128946] This architecture does not have kernel memory protection.
[ 0.129500] Run /init as init process
[ 0.131722] random: crng init done
[[ SGX-LKL ]] wg0 has public key W5/XFYlRAoayhmNTBF1ikKfZZapwu2VaDBW1TXPfkB0=
[ 0.133394] virtio_blk virtio1: [vda] 3072000 512-byte logical blocks (1.57 GB/1.46 GiB)
[ 0.135517] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts:
mmap11 1 [1;32mTPASS[0m : test completed.
[ 0.147612] reboot: Restarting system
[ SGX-LKL ] ethread (5: 0) ethread (6: 0) ethread (3: 0) ethread (2: 0) init (0: 0 exit=0) [[ SGX-LKL ]] FAIL: ethread (4: 0) Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=-1 [(?)] code=5 addr=0x7f00000ef327 opcode=0x3883 ret=0)
2020-08-06T09:36:05.000000Z [(H)ERROR] tid(0x7f10277fe700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
ethread (7: 19) [ SGX-LKL ] FAIL: sgxlkl_ethread_init() failed (id=7 result=19 (OE_ENCLAVE_ABORTING))
TEST_FAILED EXIT CODE: 1
This happens to one of the ethreads, after 6 of the 8 ethreads have already exited the enclave. Note that the signal handler cannot retrieve the tid anymore.
I suspect that the test here is irrelevant, and this is rather a rare race condition/memory corruption during shutdown. It would be good to observe this for a DEBUG build and get a stack trace. It seems different from the other issues that we have seen.
perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720
https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results
perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720
https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results
Interesting. All of the segfaults for #720 happen in SW mode. Do you have a gdb stack trace of the segfaults in #720? It may be easy to fix.
Here's one:
adjtimex01.c:24: PASS: adjtimex() with mode 0x403f
adjtimex01.c:34: PASS: adjtimex() with mode 0x8001
[ 0.145277] reboot: Restarting system
[Thread 0x7fe040f07000 (LWP 1496) exited]
Thread 4 "sgx-lkl-run-oe" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff93aeb700 (LWP 1493)]
0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
285 while (q->last_avail_idx != q->avail->idx)
(gdb) bt
#0 0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
#1 0x00000000400169c9 in console_task (arg=0x0) at host_interface/virtio_console.c:326
#2 0x00007ffff72e66db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007ffff700fa3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
I'm checking to see if other tests look similar.
@prp I checked the stack trace for 8 different failing LTP tests. They are all the same.
My guess @prp from looking at a little code is that we have this in console_task:
while (1)
{
vio_host_process_enclave_event(_cfg->dev_id, timeout_ms);
if (vio_host_check_guest_shutdown_evt())
continue;
virtio_process_queue(dev, TX_QUEUE_ID);
}
and I suspect that after the shutdown check, we are starting the shutdown while we are trying to still use the queue. But I don't know this part of the code at all, so that is pure conjecture at this point. I believe you are more familiar with the virtio code and its interactions.
Do you have any insights?
Here's another that might be related:
https://dev.azure.com/sgx-lkl/ff25f828-9f87-48e4-94f0-7449609f7e8f/_apis/build/builds/1723/logs/148
during 'report/TEST-basic-global_vars_test-(nonrelease)-(run-hw)-(8-ethreads)-junit.xml'
2020-08-06T15:53:10.1702624Z [ 0.162465] EXT4-fs (vda): mounted filesystem without journal. Opts:
2020-08-06T15:53:10.1703541Z fopen failed errno(2)
2020-08-06T15:53:10.1703688Z TEST PASSED
2020-08-06T15:53:10.1704359Z [[ SGX-LKL ]] FAIL: Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=1 [global_vars_test] code=5 addr=0x7f80001058b1 opcode=0x78b ret=0)
2020-08-06T15:53:10.1705040Z 2020-08-06T15:53:10.000000Z [(H)ERROR] tid(0x7f9c5d435700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
2020-08-06T15:53:10.1706352Z [ SGX-LKL ] ethread (6: 19) [ SGX-LKL ] FAIL: sgxlkl_ethread_init() failed (id=6 result=19 (OE_ENCLAVE_ABORTING))
2020-08-06T15:53:10.1706651Z [ 0.185198] reboot: Restarting system
2020-08-06T15:53:10.1707734Z Makefile:30: recipe for target 'run-hw' failed
2020-08-06T15:53:10.1708090Z make: *** [run-hw] Error 1
2020-08-06T15:53:10.1708540Z make run-hw: FAILED WITH EXIT CODE: 2
2020-08-06T15:53:10.1708730Z Test run completed with EXIT CODE 2
2020-08-06T15:53:10.1710290Z -----------stdout-end-------------