Disservin/fastchess

segfault at match end

Closed this issue · 3 comments

./fast-chess -recover -repeat -games 2 -rounds 100 -tournament gauntlet -pgnout out.pgn -srand $RANDOM -resign movecount=3 score=600 -draw movenumber=34 movecount=8 score=20 -variant standard -concurrency 32 -openings file=book.epd format=epd order=sequential start=10001 -engine name=sf_1 tc=inf depth=6 cmd=./sf_1 dir=. -engine name=sf_2 tc=inf depth=8 cmd=./sf_2 dir=. -each proto=uci option.Threads=1

yields...

... 
Finished game 152 (sf_2 vs sf_1): 1/2-1/2 {Draw by 3-fold repetition}
Score of sf_1 vs sf_2: 6 - 174 - 20  [0.080] 200
Elo difference: -424.28 +/- 75.66, LOS: 0.00 %, DrawRatio: 10.00 %
terminate called without an active exception
Aborted (core dumped)

Doesn't always happen, so could be a race. The gdb stack trace points to Threadpool::resize:

terminate called without an active exception

Thread 3 "fast-chess" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7a37640 (LWP 299322)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737348073024) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737348073024) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737348073024) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737348073024, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7a7f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7a657f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7e0fb9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7e1b20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7e1b277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7e1aafc in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff7c7ca06 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#10 0x00007ffff7c7d100 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#11 0x00007ffff7ada446 in __GI___pthread_unwind (buf=<optimized out>) at ./nptl/unwind.c:130
#12 0x00007ffff7acfba7 in __do_cancel () at ../sysdeps/nptl/pthreadP.h:281
#13 sigcancel_handler (sig=32, si=0x7ffff7a367f0, ctx=<optimized out>) at ./nptl/pthread_cancel.c:56
#14 sigcancel_handler (sig=<optimized out>, si=0x7ffff7a367f0, ctx=<optimized out>) at ./nptl/pthread_cancel.c:32
#15 <signal handler called>
#16 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55555573a8f0) at ./nptl/futex-internal.c:57
#17 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x55555573a8f0) at ./nptl/futex-internal.c:87
#18 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55555573a8f0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, 
    private=private@entry=0) at ./nptl/futex-internal.c:139
#19 0x00007ffff7ad0a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55555573a8a0, cond=0x55555573a8c8) at ./nptl/pthread_cond_wait.c:503
#20 ___pthread_cond_wait (cond=0x55555573a8c8, mutex=0x55555573a8a0) at ./nptl/pthread_cond_wait.c:627
#21 0x00005555555ab849 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<fast_chess::util::ThreadPool::resize(unsigned long)::{lambda()#1}> > >::_M_run() ()
#22 0x00007ffff7e49253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#23 0x00007ffff7ad1ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#24 0x00007ffff7b63850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

PS: due to the segfault, the SF processes are not killed after fast-chess termination... so when debugging they need killing manually from time to time, or get OOM killed.

yikes i was too confident yesterday https://github.com/Disservin/fast-chess/actions/runs/9963168688/job/27528590579...
this all seems to be related to https://github.com/Disservin/fast-chess/blob/master/app/src/util/threadpool.hpp#L69
this is quite a hack currently, I need to check if the issue is resolved when saying

diff --git a/app/src/util/threadpool.hpp b/app/src/util/threadpool.hpp
index ebeaf00..f383b28 100644
--- a/app/src/util/threadpool.hpp
+++ b/app/src/util/threadpool.hpp
@@ -12,6 +12,10 @@
 #include <type_traits>
 #include <vector>
 
+namespace fast_chess::atomic {
+extern std::atomic_bool stop;
+}  // namespace fast_chess::atomic
+
 namespace fast_chess::util {
 
 class ThreadPool {
@@ -66,11 +70,13 @@ class ThreadPool {
 
         for (auto &worker : workers_) {
             if (worker.joinable()) {
+                if (atomic::stop) {
 #ifdef _WIN64
-                TerminateThread(reinterpret_cast<HANDLE>(worker.native_handle()), 0);
+                    TerminateThread(reinterpret_cast<HANDLE>(worker.native_handle()), 0);
 #else
-                pthread_cancel(worker.native_handle());
+                    pthread_cancel(worker.native_handle());
 #endif
+                }
                 worker.join();
             }
         }

if this doesn't work, i fear i have to rewrite the a big part of the engine communication process, the problem is that all sys calls for reading from the engine pipe will block, making them not blocked was quite the bottleneck in the past, this leads to the behaviour that when a ctrl c or stop occurs we need to wait for that to finish.. what i now tried is to simply kill the threads which were started for handling the match such that we can interrupt this, though the the pthread cancel approach isn't really recommended... all approaches are kinda bad dunno i need to think
https://stackoverflow.com/questions/51742179/terminate-thread-c11-blocked-on-read

I have tried this diff, but it seems like it doesn't allow for proper termination of fast-chess, seems hanging to me. Maybe pilot error, will check again later.