posix: pthread_cancel test fails intermittently on RISC-V Linux
64 opened this issue · 4 comments
On commit 3459cb4, the CI encountered an intermittent failure in the pthread_cancel
test on RISC-V (under the Linux sysdeps):
61/144 mlibc:posix / pthread_cancel FAIL 1.10s killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
stderr:
In function main, file ../../../src/mlibc/tests/posix/pthread_cancel.c:94: Assertion '!ret' failed!
I haven't been able to reproduce this anywhere, so it's possible that it is just a qemu-user
bug or toolchain bug. It's also possible that the bug is present in the arch/OS-independent code too.
As discussed on discord, the suspected cause is the following sequence of events:
- Thread 2 goes to sleep for a second.
- Thread 1 cancels thread 2, by first setting
tcbCancelTriggerBit
intcb->cancelBits
, then, if cancellation is enabled, sending a SIGCANCEL to the thread. - Thread 2 wakes up after
tcbCancelTriggerBit
was set, but before the signal was sent. - Thread 2 calls
sleep
again, sees that cancellation was requested, and exits via__mlibc_do_cancel
. - Thread 1 finally gets around to sending SIGCANCEL, but it's too late, as thread 2 has quit due to the cancellation request.
- Thread 1's call to
sys_tgkill
fails, andpthread_cancel
erroneously forwards the error code.
The solution would be to check for tcb->cancelBits & tcbExitingBit
if sys_tgkill
fails, and if it's set, ignoring the error.
to reproduce, run yes | parallel strace -e trace=tgkill -e status=failed qemu-riscv64 tests/posix-pthread_cancel
on a ubuntu 20.04 machine. you'll need to watch out since the output will be spammed by useless exit codes since ubuntu 20.04 is too old for straces quiet option, but we've gotta match ci
Tange, O. (2022, May 22). GNU Parallel 20220522 ('NATO').
Zenodo. https://doi.org/10.5281/zenodo.6570228
Also reproducible on AArch64: https://github.com/managarm/mlibc/runs/7482261270