hanging server

Question

hanging server

Opened this issue 4 months ago · 12 comments

overnight the server went in a hanging state. The signature as we have observed previously. pserve 6545 at 100% load.

Using the new facility to get traces, I sampled both 6543 and 6545. Interestingly, 6545 doesn't show any thread active in a part of the fishtest python code. 6543 however seems to be deadlocked, or at least in fixed patterns in which all threads in fishtest code are waiting on locks.

===============================================================
Collecting 600 samples for pserve 6543 spaced 0.5 seconds apart
Total time needed 300 seconds 
Found pid 16765 for pserve 6543
Showing 10 most frequent samples traces: 
127 
     File "rundb.py", line 465, in flush_buffers self.scavenge(self.run_cache[oldest]["run"]) 
          File "rundb.py", line 491, in scavenge self.set_inactive_task(task_id, run) 
               File "rundb.py", line 86, in set_inactive_task with self.active_run_lock(str(run_id)): 
127 
     File "api.py", line 309, in beat self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 363, in get_run with self.run_cache_lock: 
127 
     File "api.py", line 238, in update_task result = self.request.rundb.update_task( 
          File "rundb.py", line 1173, in update_task return self.sync_update_task(worker_info, run_id, task_id, stats, spsa) 
               File "rundb.py", line 1303, in sync_update_task self.buffer(run, False) 
                    File "rundb.py", line 398, in buffer with self.run_cache_lock: 
127 
     File "api.py", line 237, in update_task self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 363, in get_run with self.run_cache_lock: 
127 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 758, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 1063, in sync_request_task self.buffer(run, False) 
                    File "rundb.py", line 398, in buffer with self.run_cache_lock:

I have the feeling 6543 is in a deadlock situation.

Answer 1 · 2024-05-29T05:36:24.000Z

(For completeness' sake, the homepage and eventlog were still loadable, but were totally frozen data-wise, as if cached. The info on the homepage didn't change for the 3 hours it was hung. All individual test pages, as well as test creation page, perpetually 504'd per vondele's comments.)

Answer 2 · 2024-05-29T08:00:36.000Z

File "rundb.py", line 86, in set_inactive_task with self.active_run_lock(str(run_id)):

set_inactive_task does not grab itself any new locks after acquiring self.active_run_lock(str(run_id)) which is a reentrant lock.

Is it theoretically possible for one of the threads to deadlock at that point? Curious...

EDIT: I guess yes if self.active_run_lock(str(run_id)) is held elsewhere. The only other place where it is used is update_task().

Answer 3 · 2024-05-29T08:12:00.000Z

Ok update_task() grabs self.run_cache_lock (via buffer()) while still holding self.active_run_lock(str(run_id))

Not sure if this a smoking gun but it would probably be safer to first release self.active_run_lock(str(run_id))
before acquiring a new lock.

Answer 4 · 2024-05-29T08:14:30.000Z

Not sure if this is related:

[6:55 PM]Dubslow: for vdbergh's benefit, the recent server marginal-overload cost us around 12ish kcores. the log was littered with "stale active task", "Too many failed update attempts", and "Post request to https://tests.stockfishchess.org/api/request_spsa failed". reducing STC+LTC tune TP helped, as did new SPRTs. the symptom that appeared to show the server was recovering was that the two latter messages disappeared, leaving still-incoming "stale active task" messages. so they dont immediately measure server load, but certainly do increase if the server was recently overloaded.
[8:58 PM]Dubslow: there was a lesser server overload about 10-15 minutes ago, i took some action to reduce tune TPs again, and i think it's mostly solved now, getting "stale active tasks" without the other two messages, as before.
[8:58 PM]Dubslow: however, now, individual test pages are all 504ing consistently. homepage and event log still load fine while every single test 504s.

It's not clear to me or vondele but perhaps the various overloads, and resulting stale tasks requiring cleanup, increased the odds of a hang...? (But if so, I would have expected the earlier, worse overload to hang rather than the second lesser one)

(before the first overload, we had roughly 70k cores. between the first and second overload, we never really had more than 62k cores, and kinda bounced between 62k and down to 48k or so at low points. the hang occurred at 51812 cores, around 10-15 minutes after the lesser overload, as stated before.)

(both overloads and resulting stale tasks could be dug out of the event log, should one desire to.)

Answer 5 · 2024-05-29T08:40:21.000Z

Ok I think I understand.

When update_task() invokes buffer() the latter acquires self.run_cache_lock which may be held by flush_buffers() which may itself be waiting for set_inactive_task() via scavenge(). This completes the cycle.

The main issue is that both update_task() and flush_buffers() acquire the same locks (self.run_cache_lock and self.active_run_lock) but in different order.

The solution is to protect set_inactive_task() (and related code) by a new lock.

Answer 6 · 2024-05-29T08:44:34.000Z

This bug was my fault (introduced in #2020)... Sorry about that.

I guess it would be useful to do an audit of all locks being used in Fishtest and the order in which they are acquired to avoid such cycles in the future.

Answer 7 · 2024-05-29T08:59:59.000Z

Tonight I will make a PR with a fix.

Answer 8 · 2024-05-29T09:11:36.000Z

While I'm not excluding that there is a bug in that PR, similar hangs have been observed prior to that PR. There might thus be similar issues lurking, you might want to keep an eye open for that.

However, this is the first time we have this kind of trace info, and I'm happy to see it turns out to be useful. We know what we can do if it happens again..

Answer 9 · 2024-05-29T09:50:42.000Z

Current RunDb lock acquire/release graph

set_inactive_run
  ACQ run_lock[run_id]
  REL run_lock[run_id]

set_inactive_task
  ACQ run_lock[run_id]
  REL run_lock[run_id]

update_workers_cores
  CALL get_unfinished_runs_id
  CALL get_run
  ACQ run_lock[run_id]
    CALL buffer
  REL run_lock[run_id]

get_run
  ACQ run_cache_lock
  REL run_cache_lock

buffer
  ACQ run_cache_lock
    ACQ run_cache_write_lock
    REL run_cache_write_lock
  REL run_cache_lock

stop
  ACQ run_cache_lock
  REL run_cache_lock

flush_buffers
  ACQ run_cache_lock
    CALL scavenge
    ACQ run_cache_write_lock
    REL run_cache_write_lock
    CALL scavenge
    ACQ run_cache_write_lock
    REL run_cache_write_lock
  REL run_cache_lock

scavenge
  CALL set_inactive_task

get_unfinished_runs_id
  ACQ run_cache_write_lock
  REL run_cache_write_lock

request_task
  ACQ request_task_lock
    CALL sync_request_task
  REL request_task_lock

sync_request_task
  CALL get_unfinished_runs_id
  CALL get_run
  CALL get_unfinished_runs_id
  CALL get_run
  CALL failed_task
  ACQ run_lock[run_id]
  REL run_lock[run_id]
  CALL buffer

update_task
  ACQ run_lock[run_id]
    CALL sync_update_task
  REL run_lock[run_id]

sync_update_task
  CALL get_run
  CALL set_inactive_task
  CALL set_inactive_task
  CALL set_inactive_task
  CALL update_spsa
  CALL stop_run
  CALL buffer

failed_task
  CALL get_run
  CALL set_inactive_task
  CALL buffer

stop_run
  CALL get_run
  CALL set_inactive_run
  CALL purge_run
  CALL buffer

approve_run
  CALL get_run
  CALL buffer

get_params
  CALL get_run

request_spsa
  CALL get_run

purge_run
  CALL buffer

update_spsa
  CALL get_params

Answer 10 · 2024-05-29T09:55:13.000Z

Current RunDb lock acquire/release graph

Which tool did you use for this?

Answer 11 · 2024-05-29T09:56:04.000Z

Which tool did you use for this?

It is manually done by hand, no tools.

Answer 12 · 2024-05-29T12:21:17.000Z

I made a PR which presumably will fix this issue. See #2042 .

However the semantics of the various locks used in Fishtest should be clarified.