lsds/sgx-lkl

Fix detached pthread failures

vtikoo opened this issue · 7 comments

PR #403 is disabling these two tests:
/tests/basic/pthread_detach
/tests/languages/nodejs

These are disabled due to issues in detached pthreads support.

mmap11 LTP test hang may be related to this.

/tests/languages/nodejs

The failing nodejs detached thread is https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L77-L92. This is created here - https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L121.

Pretty much the first thing the thread does is a futex wait syscall, and in doing so segfaults.

inline void* StartIoThreadMain(void* unused) {
  for (;;) {
    uv_sem_wait(&start_io_thread_semaphore);
  ...

Depending on whether LKL_DEBUG is turned on this fails differently.

With LKL_DEBUG:

This fails at the first LKL_TRACE in lkl_syscall for SYS_futex here.

Thread 9 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa0e9d700 (LWP 3277)]
0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
437                 if (!(p = _parse_placeholder(p, &ph, ap)))
(gdb) bt
#0  0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
#1  0x00007fe00050413f in oe_vsnprintf (str=0x7fe019414470 "", size=256, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:634
#2  0x00007fe000503180 in oe_host_vfprintf (device=0, fmt=0x7fe0006f57d0 "%.*s", ap_=0x7fe019414750)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:157
#3  0x00007fe00050363a in oe_host_printf (fmt=0x7fe0006f57d0 "%.*s")
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:184
#4  0x00007fe0005e7a4e in print (
    str=0x7fe000df8da0 "[[    LKL   ]] lkl_syscall(): enter (no=98 current=node host0->TIF host0->TIF_SIGPENDING=1)\n", 
    len=92) at lkl/posix-host.c:104
#5  0x00007fe000084ec7 in lkl_vprintf (
    fmt=0x7fe0006bb6b8 "[[    LKL   ]] %s(): enter (no=%li current=%s host0->TIF host0->TIF_SIGPENDING=%i)\n", 
    args=args@entry=0x7fe019414800) at lib/utils.c:181
#6  0x00007fe000084fc9 in lkl_printf (fmt=<optimized out>) at lib/utils.c:193
#7  0x00007fe00008bac4 in lkl_syscall (no=98, params=0x7fe019414970) at arch/lkl/kernel/syscalls.c:144

Failing instruction being -

=> 0x00007fe000504461 <+561>:   callq  0x7fe000505a20 <_parse_placeholder>

Without LKL_DEBUG:

The futex wait syscall is actually run, causing the thread to yield. While yielding back to the lthread scheduler it segfaults trying to write a rbp relative address.

Thread 6 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
81              __asm__ __volatile__ ("mov %%gs:48,%0" : "=r" (self) );
(gdb) bt
#0  0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
#1  0x00007fe0005cf731 in lthread_get_sched () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:89
#2  0x00007fe0005cff38 in _lthread_yield_cb (lt=0x7fe000df6c40, f=0x7fe0005c5932 <__do_futex_unlock>, arg=0x7fe000ab75b0 <futex_q_lock>) at sched/lthread.c:343
#3  0x00007fe0005c59cb in __do_futex_sleep (fq=0x7fe000df6d68, ts=0x0) at sched/futex.c:177
#4  0x00007fe0005c5a5b in futex_wait (uaddr=0x7fe000df6c20, val=0, ts=0x0) at sched/futex.c:215
#5  0x00007fe0005c5bdb in enclave_futex_timedwait (uaddr=0x7fe000df6c20, val=0, timeout=0x0) at sched/futex.c:276
#6  0x00007fe0005c5c25 in enclave_futex_wait (uaddr=0x7fe000df6c20, val=0) at sched/futex.c:300
#7  0x00007fe0005e7505 in sem_down (sem=0x7fe000df6c20) at lkl/posix-host.c:289
#8  0x00007fe00008a8c3 in __switch_to (prev=<optimized out>, next=<optimized out>) at arch/lkl/kernel/threads.c:194
#9  0x00007fe000424536 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3470
#10 __schedule (preempt=<optimized out>) at kernel/sched/core.c:4167
#11 0x00007fe0004247b2 in schedule () at kernel/sched/core.c:4234
#12 0x00007fe0000d87c3 in futex_wait_queue_me (hb=<optimized out>, q=0x7fe00093d000 <init_thread_info>, timeout=0x7fe03fd64000) at kernel/futex.c:2721
#13 0x00007fe0000d9343 in futex_wait (uaddr=0x0, flags=<optimized out>, val=<optimized out>, abs_time=0x0, bitset=<optimized out>) at kernel/futex.c:2827
#14 0xffffffff00080006 in ?? ()

The failing instruction is -

=> 0x00007fe0005cf715 <+13>:    mov    %rax,-0x8(%rbp)
(gdb) i r rbp
rbp            0x7fe019412000   0x7fe019412000

/tests/basic/pthread_detach

Here sgx-lkl goes into a context switch loop between ksoftirqd ==> swapper ==> idle_host_task [==> ksoftirqd], after the last detached pthread has exited.

[[    LKL   ]] free_thread_stack(): enter (task=host_clone99 task->TIF_HOST_THREAD=1 task->TIF_SIGPENDING=0 ti=0x7fa03fdc7000 current=ksoftirqd/0)
[[    LKL   ]] kill_thread(): enter (task=host_clone99 task->state=128 task->flags=32836ti=0x7fa03fdc7000 ti->flags=1282 ti->TIF_NO_TERMINATION=0 )
[[    LKL   ]] thread_join(): enter (tid=140325186146128)
[[    LKL   ]] __switch_to(): Cleaning up 0x7fa03fdc7000
[[    LKL   ]] tls_set(): enter (key=0x7fa000df7af0 data=0)
[[    LKL   ]] thread_exit(): enter
[[    LKL   ]] free_thread_stack(): Deallocating 0x7fa03fdc7000
[[    LKL   ]] __switch_to(): ksoftirqd/0=>swapper
[[    LKL   ]] __switch_to(): calling sem_down on (task=ksoftirqd/0 task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[    LKL   ]] __switch_to(): swapper=>idle_host_task
[[    LKL   ]] __switch_to(): calling sem_down on (task=swapper task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[    LKL   ]] __switch_to(): idle_host_task=>kworker/0:1
[[    LKL   ]] __switch_to(): kworker/0:1=>swapper

The app main thread in this case seems to be sleeping on a futex.

(gdb) call lthread_dump_all_threads(false)
[[  SGX-LKL ]] =============================================================
[[  SGX-LKL ]] Stack traces for all lthreads:
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 1: tid=147 (7fe000dfb4f0) [kernel] ()
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000a3ea3 in worker_thread(...)
[[  SGX-LKL ]]     #9: 7fe0000a9b54 in kthread(...)
[[  SGX-LKL ]]     #10: 7fe00008a88d in thread_bootstrap(...)
[[  SGX-LKL ]]     #11: 7fe0005cfe60 in _exec(...)
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 2: tid=146 (7fe000dfb360) [cloned host task 99] (READY)
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[  SGX-LKL ]]     #9: 7fe0000d9973 in futex_wait(...)
[[  SGX-LKL ]]     #10: 7fe0000db79b in do_futex(...)
[[  SGX-LKL ]]     #11: 7fe0000dc045 in sys_futex(...)
[[  SGX-LKL ]]     #12: 7fe00008bc54 in lkl_syscall(...)
[[  SGX-LKL ]]     #13: 7fe000b8fd76 in <unknown>(...)
[[  SGX-LKL ]]     #14: 7fe000b8fe90 in <unknown>(...)
[[  SGX-LKL ]]     #15: 7fe000b8ff87 in <unknown>(...)
[[  SGX-LKL ]]     #16: 7fe000b962ed in <unknown>(...)
[[  SGX-LKL ]]     #17: 7fe03ddd18bf in <unknown>(...)
[[  SGX-LKL ]]     #18: 7fe000ba70e4 in <unknown>(...)
[[  SGX-LKL ]]     #19: 7fe000bab511 in <unknown>(...)

// 27 other cloned host tasks have same stacktrace and state

[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 31: tid=44 (7fe000df8120) [app_name_thread] (APP_MAIN)
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[  SGX-LKL ]]     #9: 7fe0000d9973 in futex_wait(...)
[[  SGX-LKL ]]     #10: 7fe0000db79b in do_futex(...)
[[  SGX-LKL ]]     #11: 7fe0000dc045 in sys_futex(...)
[[  SGX-LKL ]]     #12: 7fe00008bc54 in lkl_syscall(...)
[[  SGX-LKL ]]     #13: 7fe000b8fd76 in <unknown>(...)
[[  SGX-LKL ]]     #14: 7fe000b8fe90 in <unknown>(...)
[[  SGX-LKL ]]     #15: 7fe000b8ff87 in <unknown>(...)
[[  SGX-LKL ]]     #16: 7fe000b9a177 in <unknown>(...)
[[  SGX-LKL ]]     #17: 7fe000b95be3 in <unknown>(...)
[[  SGX-LKL ]]     #18: 7fe03ddd196d in <unknown>(...)
[[  SGX-LKL ]]     #19: 7fe000b3f180 in <unknown>(...)
[[  SGX-LKL ]]     #20: 7fe000b3f137 in <unknown>(...)
[[  SGX-LKL ]]     #21: 7fe03ddd1776 in <unknown>(...)
[[  SGX-LKL ]] -------------------------------------------------------------
...
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 56: tid=1 (7fe000df33e0) [sgx-lkl-init] (EXITED)
[[  SGX-LKL ]]     #0: 7fe0005cfeb7 in _exec(...)
[[  SGX-LKL ]] =============================================================
prp commented

@vtikoo This may mean that the LKL termination sequence hasn't been triggered despite the fact that all application threads have exited.

@prp is the LKL termination sequence triggered only by the app main thread exiting?

prp commented

Hmm, termination should be triggered by any application thread exiting: https://github.com/lsds/lkl/blob/e041aa71e03a142ecef542c005b07d13e2a3b722/arch/lkl/kernel/threads.c#L89

Going by the stack traces above, it looks like the application main thread hasn't exited and is sleeping on a futex. This is causing LKL to have no runnable tasks after the last detached pthread exits.