hanging server
Opened this issue ยท 12 comments
overnight the server went in a hanging state. The signature as we have observed previously. pserve 6545 at 100% load.
Using the new facility to get traces, I sampled both 6543 and 6545. Interestingly, 6545 doesn't show any thread active in a part of the fishtest python code. 6543 however seems to be deadlocked, or at least in fixed patterns in which all threads in fishtest code are waiting on locks.
===============================================================
Collecting 600 samples for pserve 6543 spaced 0.5 seconds apart
Total time needed 300 seconds
Found pid 16765 for pserve 6543
Showing 10 most frequent samples traces:
127
File "rundb.py", line 465, in flush_buffers self.scavenge(self.run_cache[oldest]["run"])
File "rundb.py", line 491, in scavenge self.set_inactive_task(task_id, run)
File "rundb.py", line 86, in set_inactive_task with self.active_run_lock(str(run_id)):
127
File "api.py", line 309, in beat self.validate_request()
File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id)
File "rundb.py", line 363, in get_run with self.run_cache_lock:
127
File "api.py", line 238, in update_task result = self.request.rundb.update_task(
File "rundb.py", line 1173, in update_task return self.sync_update_task(worker_info, run_id, task_id, stats, spsa)
File "rundb.py", line 1303, in sync_update_task self.buffer(run, False)
File "rundb.py", line 398, in buffer with self.run_cache_lock:
127
File "api.py", line 237, in update_task self.validate_request()
File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id)
File "rundb.py", line 363, in get_run with self.run_cache_lock:
127
File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
File "rundb.py", line 758, in request_task return self.sync_request_task(worker_info)
File "rundb.py", line 1063, in sync_request_task self.buffer(run, False)
File "rundb.py", line 398, in buffer with self.run_cache_lock:
I have the feeling 6543 is in a deadlock situation.
(For completeness' sake, the homepage and eventlog were still loadable, but were totally frozen data-wise, as if cached. The info on the homepage didn't change for the 3 hours it was hung. All individual test pages, as well as test creation page, perpetually 504'd per vondele's comments.)
File "rundb.py", line 86, in set_inactive_task with self.active_run_lock(str(run_id)):
set_inactive_task
does not grab itself any new locks after acquiring self.active_run_lock(str(run_id))
which is a reentrant lock.
Is it theoretically possible for one of the threads to deadlock at that point? Curious...
EDIT: I guess yes if self.active_run_lock(str(run_id))
is held elsewhere. The only other place where it is used is update_task()
.
Ok update_task()
grabs self.run_cache_lock
(via buffer()
) while still holding self.active_run_lock(str(run_id))
Not sure if this a smoking gun but it would probably be safer to first release self.active_run_lock(str(run_id))
before acquiring a new lock.
Not sure if this is related:
[6:55 PM]Dubslow: for vdbergh's benefit, the recent server marginal-overload cost us around 12ish kcores. the log was littered with "stale active task", "Too many failed update attempts", and "Post request to https://tests.stockfishchess.org/api/request_spsa failed". reducing STC+LTC tune TP helped, as did new SPRTs. the symptom that appeared to show the server was recovering was that the two latter messages disappeared, leaving still-incoming "stale active task" messages. so they dont immediately measure server load, but certainly do increase if the server was recently overloaded.
[8:58 PM]Dubslow: there was a lesser server overload about 10-15 minutes ago, i took some action to reduce tune TPs again, and i think it's mostly solved now, getting "stale active tasks" without the other two messages, as before.
[8:58 PM]Dubslow: however, now, individual test pages are all 504ing consistently. homepage and event log still load fine while every single test 504s.
It's not clear to me or vondele but perhaps the various overloads, and resulting stale tasks requiring cleanup, increased the odds of a hang...? (But if so, I would have expected the earlier, worse overload to hang rather than the second lesser one)
(before the first overload, we had roughly 70k cores. between the first and second overload, we never really had more than 62k cores, and kinda bounced between 62k and down to 48k or so at low points. the hang occurred at 51812 cores, around 10-15 minutes after the lesser overload, as stated before.)
(both overloads and resulting stale tasks could be dug out of the event log, should one desire to.)
Ok I think I understand.
When update_task()
invokes buffer()
the latter acquires self.run_cache_lock
which may be held by flush_buffers()
which may itself be waiting for set_inactive_task()
via scavenge()
. This completes the cycle.
The main issue is that both update_task()
and flush_buffers()
acquire the same locks (self.run_cache_lock
and self.active_run_lock
) but in different order.
The solution is to protect set_inactive_task()
(and related code) by a new lock.
This bug was my fault (introduced in #2020)... Sorry about that.
I guess it would be useful to do an audit of all locks being used in Fishtest and the order in which they are acquired to avoid such cycles in the future.
Tonight I will make a PR with a fix.
While I'm not excluding that there is a bug in that PR, similar hangs have been observed prior to that PR. There might thus be similar issues lurking, you might want to keep an eye open for that.
However, this is the first time we have this kind of trace info, and I'm happy to see it turns out to be useful. We know what we can do if it happens again..
Current RunDb lock acquire/release graph
set_inactive_run
ACQ run_lock[run_id]
REL run_lock[run_id]
set_inactive_task
ACQ run_lock[run_id]
REL run_lock[run_id]
update_workers_cores
CALL get_unfinished_runs_id
CALL get_run
ACQ run_lock[run_id]
CALL buffer
REL run_lock[run_id]
get_run
ACQ run_cache_lock
REL run_cache_lock
buffer
ACQ run_cache_lock
ACQ run_cache_write_lock
REL run_cache_write_lock
REL run_cache_lock
stop
ACQ run_cache_lock
REL run_cache_lock
flush_buffers
ACQ run_cache_lock
CALL scavenge
ACQ run_cache_write_lock
REL run_cache_write_lock
CALL scavenge
ACQ run_cache_write_lock
REL run_cache_write_lock
REL run_cache_lock
scavenge
CALL set_inactive_task
get_unfinished_runs_id
ACQ run_cache_write_lock
REL run_cache_write_lock
request_task
ACQ request_task_lock
CALL sync_request_task
REL request_task_lock
sync_request_task
CALL get_unfinished_runs_id
CALL get_run
CALL get_unfinished_runs_id
CALL get_run
CALL failed_task
ACQ run_lock[run_id]
REL run_lock[run_id]
CALL buffer
update_task
ACQ run_lock[run_id]
CALL sync_update_task
REL run_lock[run_id]
sync_update_task
CALL get_run
CALL set_inactive_task
CALL set_inactive_task
CALL set_inactive_task
CALL update_spsa
CALL stop_run
CALL buffer
failed_task
CALL get_run
CALL set_inactive_task
CALL buffer
stop_run
CALL get_run
CALL set_inactive_run
CALL purge_run
CALL buffer
approve_run
CALL get_run
CALL buffer
get_params
CALL get_run
request_spsa
CALL get_run
purge_run
CALL buffer
update_spsa
CALL get_params
Current RunDb lock acquire/release graph
Which tool did you use for this?
Which tool did you use for this?
It is manually done by hand, no tools.