Fix detached pthread failures
vtikoo opened this issue · 7 comments
PR #403 is disabling these two tests:
/tests/basic/pthread_detach
/tests/languages/nodejs
These are disabled due to issues in detached pthreads support.
mmap11 LTP test hang may be related to this.
/tests/languages/nodejs
The failing nodejs detached thread is https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L77-L92. This is created here - https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L121.
Pretty much the first thing the thread does is a futex wait syscall, and in doing so segfaults.
inline void* StartIoThreadMain(void* unused) {
for (;;) {
uv_sem_wait(&start_io_thread_semaphore);
...
Depending on whether LKL_DEBUG is turned on this fails differently.
With LKL_DEBUG:
This fails at the first LKL_TRACE
in lkl_syscall
for SYS_futex here.
Thread 9 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa0e9d700 (LWP 3277)]
0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
437 if (!(p = _parse_placeholder(p, &ph, ap)))
(gdb) bt
#0 0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
#1 0x00007fe00050413f in oe_vsnprintf (str=0x7fe019414470 "", size=256, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:634
#2 0x00007fe000503180 in oe_host_vfprintf (device=0, fmt=0x7fe0006f57d0 "%.*s", ap_=0x7fe019414750)
at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:157
#3 0x00007fe00050363a in oe_host_printf (fmt=0x7fe0006f57d0 "%.*s")
at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:184
#4 0x00007fe0005e7a4e in print (
str=0x7fe000df8da0 "[[ LKL ]] lkl_syscall(): enter (no=98 current=node host0->TIF host0->TIF_SIGPENDING=1)\n",
len=92) at lkl/posix-host.c:104
#5 0x00007fe000084ec7 in lkl_vprintf (
fmt=0x7fe0006bb6b8 "[[ LKL ]] %s(): enter (no=%li current=%s host0->TIF host0->TIF_SIGPENDING=%i)\n",
args=args@entry=0x7fe019414800) at lib/utils.c:181
#6 0x00007fe000084fc9 in lkl_printf (fmt=<optimized out>) at lib/utils.c:193
#7 0x00007fe00008bac4 in lkl_syscall (no=98, params=0x7fe019414970) at arch/lkl/kernel/syscalls.c:144
Failing instruction being -
=> 0x00007fe000504461 <+561>: callq 0x7fe000505a20 <_parse_placeholder>
Without LKL_DEBUG:
The futex wait syscall is actually run, causing the thread to yield. While yielding back to the lthread scheduler it segfaults trying to write a rbp relative address.
Thread 6 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
81 __asm__ __volatile__ ("mov %%gs:48,%0" : "=r" (self) );
(gdb) bt
#0 0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
#1 0x00007fe0005cf731 in lthread_get_sched () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:89
#2 0x00007fe0005cff38 in _lthread_yield_cb (lt=0x7fe000df6c40, f=0x7fe0005c5932 <__do_futex_unlock>, arg=0x7fe000ab75b0 <futex_q_lock>) at sched/lthread.c:343
#3 0x00007fe0005c59cb in __do_futex_sleep (fq=0x7fe000df6d68, ts=0x0) at sched/futex.c:177
#4 0x00007fe0005c5a5b in futex_wait (uaddr=0x7fe000df6c20, val=0, ts=0x0) at sched/futex.c:215
#5 0x00007fe0005c5bdb in enclave_futex_timedwait (uaddr=0x7fe000df6c20, val=0, timeout=0x0) at sched/futex.c:276
#6 0x00007fe0005c5c25 in enclave_futex_wait (uaddr=0x7fe000df6c20, val=0) at sched/futex.c:300
#7 0x00007fe0005e7505 in sem_down (sem=0x7fe000df6c20) at lkl/posix-host.c:289
#8 0x00007fe00008a8c3 in __switch_to (prev=<optimized out>, next=<optimized out>) at arch/lkl/kernel/threads.c:194
#9 0x00007fe000424536 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3470
#10 __schedule (preempt=<optimized out>) at kernel/sched/core.c:4167
#11 0x00007fe0004247b2 in schedule () at kernel/sched/core.c:4234
#12 0x00007fe0000d87c3 in futex_wait_queue_me (hb=<optimized out>, q=0x7fe00093d000 <init_thread_info>, timeout=0x7fe03fd64000) at kernel/futex.c:2721
#13 0x00007fe0000d9343 in futex_wait (uaddr=0x0, flags=<optimized out>, val=<optimized out>, abs_time=0x0, bitset=<optimized out>) at kernel/futex.c:2827
#14 0xffffffff00080006 in ?? ()
The failing instruction is -
=> 0x00007fe0005cf715 <+13>: mov %rax,-0x8(%rbp)
(gdb) i r rbp
rbp 0x7fe019412000 0x7fe019412000
/tests/basic/pthread_detach
Here sgx-lkl goes into a context switch loop between ksoftirqd ==> swapper ==> idle_host_task [==> ksoftirqd]
, after the last detached pthread has exited.
[[ LKL ]] free_thread_stack(): enter (task=host_clone99 task->TIF_HOST_THREAD=1 task->TIF_SIGPENDING=0 ti=0x7fa03fdc7000 current=ksoftirqd/0)
[[ LKL ]] kill_thread(): enter (task=host_clone99 task->state=128 task->flags=32836ti=0x7fa03fdc7000 ti->flags=1282 ti->TIF_NO_TERMINATION=0 )
[[ LKL ]] thread_join(): enter (tid=140325186146128)
[[ LKL ]] __switch_to(): Cleaning up 0x7fa03fdc7000
[[ LKL ]] tls_set(): enter (key=0x7fa000df7af0 data=0)
[[ LKL ]] thread_exit(): enter
[[ LKL ]] free_thread_stack(): Deallocating 0x7fa03fdc7000
[[ LKL ]] __switch_to(): ksoftirqd/0=>swapper
[[ LKL ]] __switch_to(): calling sem_down on (task=ksoftirqd/0 task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[ LKL ]] __switch_to(): swapper=>idle_host_task
[[ LKL ]] __switch_to(): calling sem_down on (task=swapper task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[ LKL ]] __switch_to(): idle_host_task=>kworker/0:1
[[ LKL ]] __switch_to(): kworker/0:1=>swapper
The app main thread in this case seems to be sleeping on a futex.
(gdb) call lthread_dump_all_threads(false)
[[ SGX-LKL ]] =============================================================
[[ SGX-LKL ]] Stack traces for all lthreads:
[[ SGX-LKL ]] -------------------------------------------------------------
[[ SGX-LKL ]] 1: tid=147 (7fe000dfb4f0) [kernel] ()
[[ SGX-LKL ]] #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[ SGX-LKL ]] #1: 7fe0005c608b in futex_wait(...)
[[ SGX-LKL ]] #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[ SGX-LKL ]] #3: 7fe0005c6255 in enclave_futex_wait(...)
[[ SGX-LKL ]] #4: 7fe0005e7b57 in sem_down(...)
[[ SGX-LKL ]] #5: 7fe00008abd2 in __switch_to(...)
[[ SGX-LKL ]] #6: 7fe000424b66 in __schedule(...)
[[ SGX-LKL ]] #7: 7fe000424de2 in schedule(...)
[[ SGX-LKL ]] #8: 7fe0000a3ea3 in worker_thread(...)
[[ SGX-LKL ]] #9: 7fe0000a9b54 in kthread(...)
[[ SGX-LKL ]] #10: 7fe00008a88d in thread_bootstrap(...)
[[ SGX-LKL ]] #11: 7fe0005cfe60 in _exec(...)
[[ SGX-LKL ]] -------------------------------------------------------------
[[ SGX-LKL ]] 2: tid=146 (7fe000dfb360) [cloned host task 99] (READY)
[[ SGX-LKL ]] #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[ SGX-LKL ]] #1: 7fe0005c608b in futex_wait(...)
[[ SGX-LKL ]] #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[ SGX-LKL ]] #3: 7fe0005c6255 in enclave_futex_wait(...)
[[ SGX-LKL ]] #4: 7fe0005e7b57 in sem_down(...)
[[ SGX-LKL ]] #5: 7fe00008abd2 in __switch_to(...)
[[ SGX-LKL ]] #6: 7fe000424b66 in __schedule(...)
[[ SGX-LKL ]] #7: 7fe000424de2 in schedule(...)
[[ SGX-LKL ]] #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[ SGX-LKL ]] #9: 7fe0000d9973 in futex_wait(...)
[[ SGX-LKL ]] #10: 7fe0000db79b in do_futex(...)
[[ SGX-LKL ]] #11: 7fe0000dc045 in sys_futex(...)
[[ SGX-LKL ]] #12: 7fe00008bc54 in lkl_syscall(...)
[[ SGX-LKL ]] #13: 7fe000b8fd76 in <unknown>(...)
[[ SGX-LKL ]] #14: 7fe000b8fe90 in <unknown>(...)
[[ SGX-LKL ]] #15: 7fe000b8ff87 in <unknown>(...)
[[ SGX-LKL ]] #16: 7fe000b962ed in <unknown>(...)
[[ SGX-LKL ]] #17: 7fe03ddd18bf in <unknown>(...)
[[ SGX-LKL ]] #18: 7fe000ba70e4 in <unknown>(...)
[[ SGX-LKL ]] #19: 7fe000bab511 in <unknown>(...)
// 27 other cloned host tasks have same stacktrace and state
[[ SGX-LKL ]] -------------------------------------------------------------
[[ SGX-LKL ]] 31: tid=44 (7fe000df8120) [app_name_thread] (APP_MAIN)
[[ SGX-LKL ]] #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[ SGX-LKL ]] #1: 7fe0005c608b in futex_wait(...)
[[ SGX-LKL ]] #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[ SGX-LKL ]] #3: 7fe0005c6255 in enclave_futex_wait(...)
[[ SGX-LKL ]] #4: 7fe0005e7b57 in sem_down(...)
[[ SGX-LKL ]] #5: 7fe00008abd2 in __switch_to(...)
[[ SGX-LKL ]] #6: 7fe000424b66 in __schedule(...)
[[ SGX-LKL ]] #7: 7fe000424de2 in schedule(...)
[[ SGX-LKL ]] #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[ SGX-LKL ]] #9: 7fe0000d9973 in futex_wait(...)
[[ SGX-LKL ]] #10: 7fe0000db79b in do_futex(...)
[[ SGX-LKL ]] #11: 7fe0000dc045 in sys_futex(...)
[[ SGX-LKL ]] #12: 7fe00008bc54 in lkl_syscall(...)
[[ SGX-LKL ]] #13: 7fe000b8fd76 in <unknown>(...)
[[ SGX-LKL ]] #14: 7fe000b8fe90 in <unknown>(...)
[[ SGX-LKL ]] #15: 7fe000b8ff87 in <unknown>(...)
[[ SGX-LKL ]] #16: 7fe000b9a177 in <unknown>(...)
[[ SGX-LKL ]] #17: 7fe000b95be3 in <unknown>(...)
[[ SGX-LKL ]] #18: 7fe03ddd196d in <unknown>(...)
[[ SGX-LKL ]] #19: 7fe000b3f180 in <unknown>(...)
[[ SGX-LKL ]] #20: 7fe000b3f137 in <unknown>(...)
[[ SGX-LKL ]] #21: 7fe03ddd1776 in <unknown>(...)
[[ SGX-LKL ]] -------------------------------------------------------------
...
[[ SGX-LKL ]] -------------------------------------------------------------
[[ SGX-LKL ]] 56: tid=1 (7fe000df33e0) [sgx-lkl-init] (EXITED)
[[ SGX-LKL ]] #0: 7fe0005cfeb7 in _exec(...)
[[ SGX-LKL ]] =============================================================
@vtikoo This may mean that the LKL termination sequence hasn't been triggered despite the fact that all application threads have exited.
@prp is the LKL termination sequence triggered only by the app main thread exiting?
Hmm, termination should be triggered by any application thread exiting: https://github.com/lsds/lkl/blob/e041aa71e03a142ecef542c005b07d13e2a3b722/arch/lkl/kernel/threads.c#L89
Going by the stack traces above, it looks like the application main thread hasn't exited and is sleeping on a futex. This is causing LKL to have no runnable tasks after the last detached pthread exits.