official-stockfish/fishtest

Document load under fleet

vondele opened this issue Β· 114 comments

Just to keep some numbers for later reference, this is the average time in ms needed server side when the fleet is present (but not overloading) the server (50k cores), as obtained from the worker output (Post request https://tests.stockfishchess.org:443/api/request_task handled in 1719.69ms (server: 265.24ms)):

failed_task 13.6646
request_spsa 4.04172
request_task 149.143
request_version 0.241336
update_task 4.36213
upload_pgn 92.8058

corresponding observation on the worker side:

failed_task 742.879
request_spsa 319.939
request_task 756.641
request_version 627.497
update_task 539.884
upload_pgn 1213.44

Server-side sampling of the traces, for calls not leading to a lock looks like:

21
     File "rundb.py", line 432, in flush_buffers self.runs.replace_one(
7
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 738, in sync_request_task w = self.workerdb.get_worker(my_name)
                    File "workerdb.py", line 17, in get_worker r = self.workers.find_one(
5
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 902, in sync_request_task committed_games += (
4
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 870, in sync_request_task if connections >= connections_limit:
3
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
          File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
               File "rundb.py", line 837, in sync_request_task task_name = worker_name(task["worker_info"], short=True)
                    File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores)

Essentially all stack traces go via request_task.

And some api times of the server under load (IIRC 75k cores) but with various stages of 'distress'
medium:

# logging from [08/May/2024:18:26:11 +0000] to [08/May/2024:18:27:30 +0000] ########
# duration (seconds)            : 79
# calls in total                : 10000
# calls per second              : 126
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum
                      /api/beat     5523   2100.429      0.380      0.001      4.014
               /api/update_task     3154    723.048      0.229      0.002      4.014
              /api/request_task      376    138.410      0.368      0.038      3.648
           /api/request_version      371     62.196      0.168      0.000      3.232
                   /tests/view/       45     45.010      1.000      0.025      3.406
                /api/upload_pgn      367     19.217      0.052      0.007      0.289
                  /tests/tasks/       18      6.594      0.366      0.045      2.577
                         /tests        9      6.145      0.683      0.029      0.936
              /api/request_spsa       42      3.253      0.077      0.002      0.363
                   /tests/user/        2      1.172      0.586      0.540      0.632
                   /api/actions       28      0.840      0.030      0.003      0.161
                       /api/nn/       59      0.625      0.011      0.003      0.059
                  /contributors        1      0.293      0.293      0.293      0.293

5 request_task / s

high:

# logging from [08/May/2024:18:38:01 +0000] to [08/May/2024:18:39:06 +0000] ########
# duration (seconds)            : 65
# calls in total                : 10000
# calls per second              : 153
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum
                     /api/beat     4854  12774.715      2.632      0.002     24.684
              /api/update_task     2673   4104.362      1.535      0.002     24.689
          /api/request_version     1711   1581.257      0.924      0.001     23.757
             /api/request_task      583   1036.156      1.777      0.042     24.179
                 /api/get_elo/        7     29.748      4.250      0.108     24.629
                        /tests       10      8.301      0.830      0.040      1.138
             /api/request_spsa       31      7.610      0.245      0.005      0.739
               /api/upload_pgn       57      5.803      0.102      0.008      3.694
              /tests/live_elo/        4      3.639      0.910      0.223      2.038
                  /tests/user/        1      0.517      0.517      0.517      0.517
                  /api/actions       23      0.412      0.018      0.002      0.061
                      /api/nn/       41      0.288      0.007      0.003      0.047
                  /tests/view/        1      0.261      0.261      0.261      0.261
               /tests/finished        1      0.121      0.121      0.121      0.121

9 request_task / s

too much:

# logging from [08/May/2024:18:42:28 +0000] to [08/May/2024:18:43:19 +0000] ########
# duration (seconds)            : 51
# calls in total                : 10000
# calls per second              : 196
# calls not reaching the backend: 
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
                      /api/beat     4662  23688.011      5.081      0.002     30.029
               /api/update_task     4074  20350.370      4.995      0.003     30.031
           /api/request_version      767   4114.809      5.365      0.003     29.990
              /api/request_task      341   1665.680      4.885      0.071     29.987
              /api/request_spsa       66     70.501      1.068      0.010     12.693
               /api/failed_task        9     48.770      5.419      0.012     29.984
                   /tests/view/        4      7.470      1.867      0.175      6.781
                         /tests        9      6.444      0.716      0.022      1.260
                     /tests/run        2      5.032      2.516      1.115      3.917
                /api/upload_pgn       32      1.834      0.057      0.008      0.272
                   /tests/user/        2      1.174      0.587      0.498      0.676
                   /api/actions       22      0.470      0.021      0.007      0.136

7 request_task / s

if we go with the above timing for request_task, we can handle 1000 / 149 ~ 7 of them per second, so that's consistent with the numbers we have here (without timeouts at various rate for the profile 2 and 3).

@vondele Can you explain why the internal throttling in Fishtest does nothing (at most five concurrent accesses to request_task)? I had the PR #1987 to set the number to 1 but I closed it since I did not understand what was going on.

no, I don't know why that happens. Admittedly, I don't quite know how the frameworks around our python code do things, if they would only allow one concurrent access to the same endpoint, it would be obvious.

# Limit concurrent request_task
task_lock = threading.Lock()
task_semaphore = threading.Semaphore(4)

def request_task(self, worker_info):
if self.task_semaphore.acquire(False):
try:
with self.task_lock:
return self.sync_request_task(worker_info)
finally:
self.task_semaphore.release()
else:
print("request_task too busy", flush=True)
return {"task_waiting": False}

It should queue only 4 requests and drop the other until a slot in the semaphore is released...

The entire thing is basically single threaded from the start because of how python does thing. I think Pasquale has worked around that by configuring the nginx so that certain endpoints are forwarded to different python processes (I think we got 2 running)? So all requests to the same endpoint are handled sequentially without any concurrency iirc

###
# wsgi server configuration
###
[server:main]
use = egg:waitress#main
host = 127.0.0.1
port = %(http_port)s
trusted_proxy = 127.0.0.1
trusted_proxy_count = 1
trusted_proxy_headers = x-forwarded-for x-forwarded-host x-forwarded-proto x-forwarded-port
clear_untrusted_proxy_headers = yes
# Match connection limit with max number of workers
connection_limit = 100
threads = 4

Each of the 3 instances should have 4 threads, so the semaphore should never overtook.

PROD running with 10 threads per instance. I will check in the log for requests rejected.

PROD running with 10 threads per instance. I will check in the log for request rejected.

If Disservin is correct then the # of threads is irrelevant because of the GIL.

EDIT: that’s not quite right. Threads can switch between the execution of byte codes.

let me add some workers to check this, I need to pause the progression tests for that.

import threading
import time

class Test:
    def __init__(self):
        self.task_lock = threading.Lock()
        self.task_semaphore = threading.Semaphore(4)

    def access_resource(self, thread_id):
        if self.task_semaphore.acquire(False):
            try:
                with self.task_lock:
                    print(f"Thread {thread_id} is accessing the resource.")
                    time.sleep(0.001)
            finally:
                self.task_semaphore.release()
                print(f"Thread {thread_id} has released the resource.")
        else:
            print(f"Thread {thread_id} could not acquire the semaphore, resource is too busy.")

test = Test()

threads = []
for i in range(1000):
    thread = threading.Thread(target=test.access_resource, args=(i,))
    threads.append(thread)
    thread.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy

just FYI, while the workers are ramping up, initial phase have the following server timings:

request_spsa 11.578
request_task 39.5495
request_version 0.349775
update_task 6.072

2-3min later:

failed_task 32.0517
request_spsa 9.76163
request_task 55.4572
request_version 0.594289
update_task 3.42941

about 9min later

failed_task 36.864
request_spsa 7.39204
request_task 61.0517
request_version 0.543516
update_task 3.66163
upload_pgn 15.84

22min later:

failed_task 37.8486
request_spsa 9.56024
request_task 72.8793
request_version 0.499205
update_task 10.0572
upload_pgn 23.6664

34min later:

failed_task 38.7616
request_spsa 8.77399
request_task 80.093
request_version 0.464707
update_task 9.20768
upload_pgn 27.9647

It is clear that request_task timing grows quickly with the server load:
typical worker:

Post request https://tests.stockfishchess.org:443/api/request_task handled in 148.79ms (server: 14.41ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 356.46ms (server: 197.66ms)

or

Post request https://tests.stockfishchess.org:443/api/request_task handled in 171.36ms (server: 13.02ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 351.89ms (server: 184.89ms)

so in the ramp-up phase, with few workers active we see about 14ms for a call, later this is closer to 190ms for a call.

PROD running with 10 threads per instance. I will check in the log for request rejected.

If Disservin is correct then the # of threads is irrelevant because of the GIL.

EDIT: that’s not quite right. Threads can switch between the execution of byte codes.

Well I only mentioned the bits I’ve collected, also a first I’m seeing the prod ini with the threads so take everything with a grain of salt from me

It's the first proof that waitress threads are for real :)

as we've done the check, I'll take the workers down.

As shown in #2003 (comment) it looks clear that the request_task api is getting increasingly expensive under load from roughly 15 to 190ms per call between the first call of a worker and the second call.

I think the number of connections in waitress should be equal to the number of threads (so that waitress does not queue connections) and then the value of the request_task semaphore should be adjusted so that Fishtest does not become overloaded.

For years we ran the server with 500 connection_limit and 16 threads, I didn't measure any improvement with 2000 machines 17000 cores in PROD, so I restored the waitress default 100 connection_limit and 4 threads in this b960e2b
https://docs.pylonsproject.org/projects/waitress/en/stable/arguments.html

In nginx we keep alive 64 connections for upstream to avoid the cost to close and open connections (not sure the 2 config are talking about the same "connection")

upstream backend_6543 {
    server 127.0.0.1:6543;
    keepalive 64;
}

The thing is that 100 is too big.

Waitress will simply queue connections without giving Fishtest a chance to reject them. That's why we are seeing 30s for api calls. They are just queued in waitress.

EDIT: I mean too big compared to the number of worker threads.

We set 30 second in nginx

    location / {
        rewrite ^/$       /tests permanent;
        rewrite ^/tests/$ /tests permanent;
        proxy_set_header Connection "";
        proxy_set_header X-Forwarded-Proto  $scheme;
        proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host   $host:$server_port;
        proxy_set_header X-Forwarded-Port   $server_port;
        proxy_set_header X-Country-Code     $region;

        client_max_body_size        120m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       5s;
        proxy_send_timeout          30s;
        proxy_read_timeout          30s;     # timeout
        proxy_max_temp_file_size    0;
        proxy_redirect              off;
        proxy_http_version          1.1;

        proxy_cache cache_zone;

        gunzip on;

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            deny all;
        }
    }

Yes nginx caps the connection time. But either waitress or Fishtest should reject the connections. It should not be left to nginx.

It's really the job of nginx. We can set connection limit and rate for host, uri, binary_remote_addr, globally or even for any single path.

limit_conn_zone $host zone=perhost:10m;
limit_conn perhost 10;
limit_conn_status 429;

It's really the job of nginx. We can set limits for host, uri, even for any single path.

limit_conn_zone $host zone=perhost:10m;
limit_conn perhost 10;

Well then nginx should limit the number of connections to waitress to the number of threads in Fishtest. Otherwise waitress will just queue the connections.

The point is we need to give Fishtest a chance to reject the request_task connections using the semaphore.

Note that Fishtest will reject the connection immediately. There will be no 30s timeout.

set 10 connections in nginx (like the waitress threads), tested opening 50 connections concurrently.

2024/05/14 00:52:57 [error] 13428#13428: *790 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:52:57 [alert] 13429#13429: ignore long locked inactive cache entry 522ba7397b573b51acc828305a6d6a06, count:7
2024/05/14 00:53:59 [error] 13428#13428: *1075 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:53:59 [error] 13428#13428: *1073 challenging request, client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"

set 30 connections in nginx and 10 connections in waitress (like the 10 threads).
IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....

Commented out the nginx limit.
image

set 30 connections in nginx and 10 connections in waitress (like the 10 threads). IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....

That is of course true. But one can overdo it which leads to buffer bloat https://en.wikipedia.org/wiki/Bufferbloat . Currently we have an ngnix queue, a waitress queue and a queue in Fishtest (of threads waiting for locks to clear) and it becomes hard to understand how they interact.

I think the waitress queue should be eliminated since nginx is already queing so that Fishtest can take control.

Of course I may be wrong but it seems natural to me that if Fishtest can handle say 7 request task calls per second it should be given the opportunity to reject more. Now it does not have that opportunity since it seems to me that most connections are queued by waitress until they are closed by nginx after 30 seconds.

I really think we should understand and fix why fishtest can only handle that few request_task calls?

If we can go by the profile in #2003 (comment) we see that it spends most time in:

  • accessing the workerdb to get worker info (might depend on workload with workers), might be hard to avoid?
  • counting committed games by iterating overall tasks (grows with workers) to avoid assigning to many
  • counting connections to see if it reaches the limit by iterating over active_tasks (grows with workers)
  • checking if a worker name is already connected by iterating over active_tasks (grows with workers).

so, somehow the 3 of 4 hotspots are checks, can they be written or implemented differently / are they even needed at this point in the code?

if we can't fix request_task, probably, we should rate limit it at the nginx level, while at the same time modify the worker to catch that condition. Not really as an error (i.e. no other messages sent to the server), but just for the worker to go to sleep and retry a while later, i.e. not really an error condition, rather a signal.

I haven't looked at the code yet, but first glance at the 4 parts of request_task vdv lists:

  • accessing workerdb should be fast
  • presumably an internal count for number of tasks (games?) could be maintained to avoid the need for iteration?
  • and maintain another internal count for connections?
  • is some kind of python dict (or similar) possible to avoid iteration?

if we can't fix request_task, probably, we should rate limit it at the nginx level, while at the same time modify the worker to catch that condition. Not really as an error (i.e. no other messages sent to the server), but just for the worker to go to sleep and retry a while later, i.e. not really an error condition, rather a signal.

It is ok to return an error since the worker will not report back on a failed request_task.

if we can't fix request_task, probably, we should rate limit it at the nginx level

It is still not clear to me why we cannot have fishtest do the rate limiting (i.e. return immediately if too many calls are waiting). Fishtest can return a much clearer error message than nginx.

Currently this does not work because waitress does not pass the connections on to Fishtest but instead queues them forever until they time out in nginx. This is because waitress allows 100 simultaneous connection but it only allows 4 threads in Fishtest.

of course we could have fishtest handle this, but nginx is used to deal with it, i.e. could be overloaded with 10000s of calls and handle it fine. It can also do more subtle things, like allow for burst of requests for a short while, etc. I'm not too opinionated where to put that check, but it needs the worker to handle it gracefully/smartly.

However, the issue really is that request task is not fast enough to sustain the load, not how we want to fail in the presence of load. If I'm not mistaken, it is again something that scales quadratically in the number of workers. Basically, the number of requests grows linearly with the number of workers, and for each request one needs to loop over all active tasks, which grows again linearly with the number workers. Combined we have again the quadratic scaling. I believe the solution that xoto10 suggests is what is needed, the loops need to be replaced by bookkeeping, somehow.

In principle the count of committed games could be easily maintained dynamically, like what is done for cores and workers

if task_finished:
# run["cores"] is also updated in request_task().
# We use the same lock.
with self.task_lock:
run["workers"] -= 1
run["cores"] -= task["worker_info"]["concurrency"]
assert run["cores"] >= 0

@ppigazzini Does this assert ever trigger?

If we're unsure if a maintained count is reliable, or just as an internal sanity check, the code could iterate for 1 in a 100 requests and log if the actual count is different from the maintained count.

Not needed with perfect code ofc, but might be useful in real life ...

Actually cores and workers are not fully dynamic since they are recomputed in request_task and cached for 1 minute.

It would be good to first try to maintain cores and workers fully dynamically.

for consideration... flush_buffers is called every second, independent of the worker load (IIUC), something like that (i.e. frequent but worker count independent) could be done to write checks or initialized cached results., I think.

I agree with fishtest rejecting requests when overloaded. There will always be a limit to how many requests can be processed per second, so the excess should be rejected somehow. At the moment, we process say 7/s, so if a sustained rate of 10 requests per second is made, we have to reject 30% of them, there is no choice (never mind 100 or 1000!). The code to do this is right at the start of request_task so the time taken to do this should be minimal.

If the load is too fast even for this to cope then nginx can queue (and ultimately reject) requests.

I note that INITIAL_RETRY_TIME = 15.0, it seems to me there should be a hierarchy. Perhaps the internal queue(s) of 4 smooth across a second or two, followed by rejections if overloaded. nginx could time out after 5-10s if fishtest very heavily loaded, and the worker retries happen after 15s/30s/minutes ?

There is a hierarchy, in that it increases from 15s to 900s doubling on every failure. The pending https://github.com/official-stockfish/fishtest/pull/1983/files even started at 120s, but I might agree this is too much for failures that don't cause any follow-up message to the server (such as request_task).

It would be good to first try to maintain cores and workers fully dynamically.

So first and foremost the ubiquitous

task["active"] = False

should be made illegal...

It should be replaced by something like

def set_inactive(self, task_id, run):
  with self.task_lock:
    task = run["tasks"][task_id]
    if not task["active"]:
      return    
    run["workers"] -= 1 
    run["cores"] -= task["worker_info"]["concurrency"] 
    task["active"] = False

EDIT: IUC then run["workers"] should always be equal to the number of active tasks in the run. This is something that can be put in the schema. Something similar can be done for run["cores"].

should it make sure under the lock that it only sets active tasks to false, and decrement cores?

def set_inactive(self, task_id, run):
  task = run["tasks"][task_id]
  with self.task_lock:
    if task["active"]:
       run["workers"] -= 1 
       run["cores"] -= task["worker_info"]["concurrency"] 
       task["active"] = False

should it make sure under the lock that it only sets active tasks to false, and decrement cores?

Yes, I just corrected that...

@ppigazzini Does this assert ever trigger?

the issue is that the assert is not informative

>>> run = {"cores": -1}
>>> assert run["cores"] >= 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
sudo journalctl -u fishtest@6543 --since "3 months ago" | grep AssertionError
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]: AssertionError
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]: AssertionError
Feb 25 00:26:09 tests.stockfishchess.org pserve[22927]: AssertionError
Mar 05 15:09:17 tests.stockfishchess.org pserve[17087]: AssertionError
Mar 13 03:54:25 tests.stockfishchess.org pserve[17087]: AssertionError:
Mar 13 17:18:58 tests.stockfishchess.org pserve[17087]: AssertionError
Mar 16 14:31:51 tests.stockfishchess.org pserve[17087]: AssertionError
Mar 21 16:23:48 tests.stockfishchess.org pserve[17087]: AssertionError:
Mar 21 20:28:27 tests.stockfishchess.org pserve[17087]: AssertionError:
Mar 22 04:46:00 tests.stockfishchess.org pserve[17087]: AssertionError:
Mar 23 20:27:42 tests.stockfishchess.org pserve[17087]: AssertionError:
Mar 25 13:53:30 tests.stockfishchess.org pserve[17087]: AssertionError
Mar 30 13:38:06 tests.stockfishchess.org pserve[17087]: AssertionError
Apr 01 01:26:46 tests.stockfishchess.org pserve[17087]: AssertionError
Apr 02 19:06:21 tests.stockfishchess.org pserve[6917]: AssertionError
Apr 02 19:06:44 tests.stockfishchess.org pserve[6917]: AssertionError
Apr 05 13:11:30 tests.stockfishchess.org pserve[24267]: AssertionError
Apr 07 08:19:45 tests.stockfishchess.org pserve[24267]: AssertionError
Apr 07 08:20:09 tests.stockfishchess.org pserve[24267]: AssertionError
Apr 21 14:26:10 tests.stockfishchess.org pserve[19678]: AssertionError
Apr 25 10:27:50 tests.stockfishchess.org pserve[31771]: AssertionError
Apr 27 11:10:38 tests.stockfishchess.org pserve[21423]: AssertionError
Apr 27 12:45:36 tests.stockfishchess.org pserve[29624]: AssertionError

yes, crossed, but I think it needs to be in the lock?

@ppigazzini

$ sudo journalctl -u fishtest@6543 --since "3 months ago" | grep AssertionError -B2
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]:     assert run["cores"] >= 0
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]:            ^^^^^^^^^^^^^^^^^
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]: AssertionError
--
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]:     assert run["cores"] >= 0
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]:            ^^^^^^^^^^^^^^^^^
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]: AssertionError

I guess so.... Changing it again.

$ sudo journalctl -u fishtest@6543 --since "3 months ago" | grep AssertionError -B2
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]:     assert run["cores"] >= 0
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]:            ^^^^^^^^^^^^^^^^^
Feb 20 18:24:09 tests.stockfishchess.org pserve[18998]: AssertionError
--
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]:     assert run["cores"] >= 0
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]:            ^^^^^^^^^^^^^^^^^
Feb 22 13:25:48 tests.stockfishchess.org pserve[18998]: AssertionError

So this shows that maintaining things dynamically needs careful thought. But it should be good to do that exercise for the sanity of the Fishtest codebase.

@vondele Ah I see you already posted the correct code in your reply. Sorry for not noticing this. I thought you were just quoting my code.

If the load is too fast even for this to cope then nginx can queue (and ultimately reject) requests.

nginx can reject asap, even at path level, according to several policies based on rate and/or number of connections.
A request, already know to be rejected, adds avoidable load to waitress and python code.

If the load is too fast even for this to cope then nginx can queue (and ultimately reject) requests.

nginx can reject asap, even at path level, according to several policies based on rate and/or number of connections. A request, already know to be rejected, add avoidable load to waitress and python code.

This it true to some extent. But nginx does not know how many requests Fishtest can handle. Configuring nginx to allow at most 7 request_task/s seems overly constraining since this number may change over time.

I agree with xoto10 we should have both. A high limit in nginx to not overload Fishtest in extreme conditions (so that it has the time to reject things) and an internal limit in Fishtest specifically for expensive apis (currently only request_task it seems).

Is nginx configured to do this now for request_task?

Agree we don't want to use this at 7/sec.

If we use this, I suggest we only use it for very high numbers, maybe 100/sec ??

It would be good to first try to maintain cores and workers fully dynamically.

Next request_task should only acquire the task_lock when actually attaching a newly created task to the run...

I agree with xoto10 we should have both. A high limit in nginx to not overload Fishtest in extreme conditions (so that it has the time to reject things) and an internal limit in Fishtest specifically for expensive apis (currently only request_task it seems).

Double read what xoto10 wrote (and I quoted) about nginx before agreeing :)

If the load is too fast even for this to cope then nginx can queue (and ultimately reject) requests.

I added the information on what is possible to do with nginx.
nginx is long time configured with several limits to pass only a reasonable load to waitress.
An issue is that the workers start sending requests to the Event Logs when the original request are rejected.
The waitress instance at 6545 (Events Logs) deadlocked 2 times with the fleet, only an instance restart solved the issue.

An issue is that the workers start sending requests to the Event Logs when the original request are rejected.

I don't understand this bit, is this an event log on the server? the workers try to log the rejection back at the server??

An issue is that the workers start sending requests to the Event Logs when the original request are rejected.

I don't understand this bit, is this an event log on the server? the workers try to log the rejection back at the server??

https://tests.stockfishchess.org/actions?page=1&user=vdv&action=failed_task

https://tests.stockfishchess.org/actions?action=failed_task&user=vdv&text=update

When tests with many cores finish, there will be a burst of many workers (possibly 100-300?) requesting a new job within a few seconds, would it make sense to add some randomisation to the sleep time? (worker.py line 1605)

            else:
+               delay = random.randint(delay*2/3, delay*4/3)
                print("Waiting {} seconds before retrying".format(delay))
                safe_sleep(delay)
                delay = min(MAX_RETRY_TIME, delay * 2)

I'm not sure, but I don't think this happens if the server returns the task_waiting (i.e. server too busy) result. Another reason to return a result rather than allowing a timeout.

When tests with many cores finish, there will be a burst of many workers (possibly 100-300?) requesting a new job within a

During the last tests, a finished run released ~14k cores, aka ~2k workers

I'm not sure, but I don't think this happens if the server returns the task_waiting (i.e. server too busy) result. Another reason to return a result rather than allowing a timeout.

Check the links of the Events Log, the worker reports any issue (at example hitting the GitHub API cap, a missed update task, a bad configured toolchain etc.).
With a big fleet, when the primary instance at 6543 is overloaded, or when a nginx start dropping the requests, the secondary instance at 6545 is flooded.

The worker should be updated to deal with the status returned by nginx as well (once we agree on the final nginx, waitress, python setup, of course).

I added the information on what is possible to do with nginx.
nginx is long time configured with several limits to pass only a reasonable load to waitress.

Well as I said, one issue is that if there are too many workers, then waitress is overloaded, since it has too few worker threads (only 4).

An issue is that the workers start sending requests to the Event Logs when the original request are rejected.
The waitress instance at 6545 (Events Logs) deadlocked 2 times with the fleet, only an instance restart solved the issue.

This is not an issue with request_task (which we are currently discussing). After a rejected request_task the worker will simply go to sleep and retry again.

During the last tests, a finished run released ~14k cores, aka ~2k workers

I am not sure if there is a scenario in which after a finished run all the 2k workers start posting to the event log (that would be a bug). Normally there should just be a single finished_run event.

the scenario is where the server is overloaded by all these workers suddenly interacting with the server (uploading pgns, requesting the version, requesting the new tasks), and suddingly whatever is still running on other tasks (updating tasks, requesting spsa) start failing massively.

If a test passes, and the users stops within a few seconds other tests, there can be even larger 'relocations'.

meanwhile, adding 75k cores to the server, these are the average timings:

request_spsa 3.33304
request_task 222.426
request_version 0.198212
update_task 1.97667
upload_pgn 22.01

server is over the edge now

the scenario is where the server is overloaded by all these workers suddenly interacting with the server (uploading pgns, requesting the version, requesting the new tasks), and suddingly whatever is still running on other tasks (updating tasks, requesting spsa) start failing massively.

They are not really failing but waitress blocks the requests forever (since it allocates only 4 worker threads to fishtest).

If waitress would pass them on (by allocating more threads) then Fishtest could at least reject request_task quickly if its queue is too long, using the semaphore code which is already there.

If a test passes, and the users stops within a few seconds other tests, there can be even larger 'relocations'.

PROD waitress is configured with:

# Match connection limit with max number of workers
connection_limit = 10
threads = 10
May 14 17:11:32 tests.stockfishchess.org pserve[15504]: request_task too busy
May 14 17:11:37 tests.stockfishchess.org pserve[15504]: request_task too busy
May 14 17:11:37 tests.stockfishchess.org pserve[15504]: request_task too busy
May 14 17:11:59 tests.stockfishchess.org pserve[15504]: request_task too busy
May 14 17:11:59 tests.stockfishchess.org pserve[15504]: request_task too busy
May 14 17:12:00 tests.stockfishchess.org pserve[15504]: request_task too busy

Timing with fleet at 75k cores (65k joined) with the new waitress configuration (10 threads, 10 connections, no nginx filtering)

fishtest@tests:~$ sudo bash x/ana_access_log.sh
# logging from [14/May/2024:17:34:56 +0000] to [14/May/2024:17:35:58 +0000] ########
# duration (seconds)            : 62
# calls in total                : 10000
# calls per second              : 161
# calls not reaching the backend:
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
               /api/update_task     4120  19914.853      4.834      0.001     30.023
                      /api/beat     4244  16596.756      3.911      0.001     30.029
           /api/request_version      775   1921.201      2.479      0.000     30.000
              /api/request_task      499   1324.154      2.654      0.008     30.004
              /api/request_spsa      176    370.445      2.105      0.002     29.985
               /api/failed_task       41    100.976      2.463      0.003     29.983
                   /tests/view/        9     61.705      6.856      0.000     31.005
                  /api/get_elo/        6     35.309      5.885      0.039     30.005
                         /tests       26     18.989      0.730      0.015      2.738
                   /tests/user/        5      5.303      1.061      0.865      1.225
                /api/upload_pgn       50      3.047      0.061      0.016      0.350
                  /tests/tasks/        1      0.715      0.715      0.715      0.715
                   /api/actions       33      0.574      0.017      0.003      0.036
                /tests/finished        2      0.191      0.096      0.037      0.154
                 /tests/approve        2      0.151      0.075      0.051      0.100
               /api/active_runs        2      0.118      0.059      0.053      0.065
                       /api/nn/        9      0.054      0.006      0.004      0.008

Trying to reduce priority of my tune I get:

504 Gateway Time-out

Edit: Wrong! I was modifying another job, not the tune. I did manage to reduce the tp on the tune.

Timing with fleet at 75k cores (65k joined) with the new waitress configuration (10 threads, 10 connections, no nginx filtering)

fishtest@tests:~$ sudo bash x/ana_access_log.sh
# logging from [14/May/2024:17:34:56 +0000] to [14/May/2024:17:35:58 +0000] ########
# duration (seconds)            : 62
# calls in total                : 10000
# calls per second              : 161
# calls not reaching the backend:
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
               /api/update_task     4120  19914.853      4.834      0.001     30.023
                      /api/beat     4244  16596.756      3.911      0.001     30.029
           /api/request_version      775   1921.201      2.479      0.000     30.000
              /api/request_task      499   1324.154      2.654      0.008     30.004
              /api/request_spsa      176    370.445      2.105      0.002     29.985
               /api/failed_task       41    100.976      2.463      0.003     29.983
                   /tests/view/        9     61.705      6.856      0.000     31.005
                  /api/get_elo/        6     35.309      5.885      0.039     30.005
                         /tests       26     18.989      0.730      0.015      2.738
                   /tests/user/        5      5.303      1.061      0.865      1.225
                /api/upload_pgn       50      3.047      0.061      0.016      0.350
                  /tests/tasks/        1      0.715      0.715      0.715      0.715
                   /api/actions       33      0.574      0.017      0.003      0.036
                /tests/finished        2      0.191      0.096      0.037      0.154
                 /tests/approve        2      0.151      0.075      0.051      0.100
               /api/active_runs        2      0.118      0.059      0.053      0.065
                       /api/nn/        9      0.054      0.006      0.004      0.008

Averages seem to be a bit better though not spectacularly. There are still 30s timeouts. I guess these are due to Fishtest now and not to waitress since waitress is no longer queuing.

Timing with fleet at 50k cores (all cores joined) with the new waitress configuration (10 threads, 10 connections, no nginx filtering)

fishtest@tests:~$ sudo bash x/ana_access_log.sh
# logging from [14/May/2024:18:19:06 +0000] to [14/May/2024:18:20:39 +0000] ########
# duration (seconds)            : 93
# calls in total                : 10000
# calls per second              : 107
# calls not reaching the backend:
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
                      /api/beat     5491  10755.127      1.959      0.001     30.119
               /api/update_task     3702   3999.841      1.080      0.001     30.042
              /api/request_task      176    258.019      1.466      0.003     30.042
           /api/request_version      187    151.454      0.810      0.000     29.988
              /api/request_spsa      176    122.247      0.695      0.001     29.984
               /api/failed_task        6     30.135      5.022      0.004     29.980
                   /tests/user/       17     17.955      1.056      0.708      2.232
                         /tests       22     17.872      0.812      0.011      2.072
                /api/upload_pgn      123      6.679      0.054      0.010      0.560
                     /tests/run        2      3.305      1.653      0.576      2.729
                  /tests/tasks/        2      2.044      1.022      0.924      1.120
                   /api/actions       48      1.455      0.030      0.003      0.279
                   /tests/view/       14      0.935      0.067      0.019      0.143
                  /api/get_elo/       14      0.536      0.038      0.017      0.075
                /tests/finished        2      0.366      0.183      0.169      0.197
                           /nns        3      0.360      0.120      0.015      0.327
               /tests/live_elo/       11      0.301      0.027      0.002      0.051
               /api/active_runs        2      0.118      0.059      0.053      0.065
                  /tests/modify        1      0.082      0.082      0.082      0.082
                 /tests/approve        1      0.055      0.055      0.055      0.055
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15151 mongodb   20     4987336 1.873g  14036 S  15.6 37.5   9:12.42  `- /usr/bin/mongod --config /etc/mongod.conf
15504 fishtest  20     2034888 511892   5740 R  44.0  9.8  50:57.23  `- /home/fishtest/fishtest/server/env/bin/python3 /home/fishtest/fishtest/server/env/bin/pserve production.ini http_port=6543
15505 fishtest  20     1792252 563880   4856 S  37.7 10.8  12:40.41  `- /home/fishtest/fishtest/server/env/bin/python3 /home/fishtest/fishtest/server/env/bin/pserve production.ini http_port=6544
15506 fishtest  20     1655992 510000   6460 S   5.3  9.7  14:16.69  `- /home/fishtest/fishtest/server/env/bin/python3 /home/fishtest/fishtest/server/env/bin/pserve production.ini http_port=6545
20177 root      20      259360 159112        S        3.0            `- nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
20178 www-data  20      327500 231204   3852 R  35.8  4.4  19:44.38      `- nginx: worker process
20179 www-data  20      259508 160160    832 S        3.1   0:00.02      `- nginx: cache manager process

This seems to be much worse that the profile posted by vdv with 126 calls/s. So maybe my theory is wrong...

This seems to be much worse that the profile posted by vdv with 126 calls/s. So maybe my theory is wrong...

We are testing some configurations to understand why the Max time is still at 30 seconds with so low CPU usage.

This seems to be much worse that the profile posted by vdv with 126 calls/s. So maybe my theory is wrong...

threads/limit_connnections:
10/10 - bad API timing and nginx logs errors
4/10 - bad API timing and nginx logs errors
10/50 - bad API timing and nginx logs errors
10/100 - good API timing and no nginx log errors

example of nginx errors:

2024/05/14 19:06:26 [error] 5194#5194: *723978 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/update_task HTTP/1.1", upstream: "http://127.0.0.1:6543/api/update_task", host: "tests.stockfishchess.org"
2024/05/14 19:06:26 [error] 5194#5194: *723933 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/update_task HTTP/1.1", upstream: "http://127.0.0.1:6543/api/update_task", host: "tests.stockfishchess.org"
2024/05/14 19:06:26 [error] 5194#5194: *723960 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/beat HTTP/1.1", upstream: "http://127.0.0.1:6543/api/beat", host: "tests.stockfishchess.org"
2024/05/14 19:06:26 [error] 5194#5194: *723988 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/update_task HTTP/1.1", upstream: "http://127.0.0.1:6543/api/update_task", host: "tests.stockfishchess.org"
2024/05/14 19:06:26 [error] 5194#5194: *724164 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/update_task HTTP/1.1", upstream: "http://127.0.0.1:6543/api/update_task", host: "tests.stockfishchess.org"
2024/05/14 19:06:26 [error] 5194#5194: *724167 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/beat HTTP/1.1", upstream: "http://127.0.0.1:6543/api/beat", host: "tests.stockfishchess.org"
2024/05/14 19:06:30 [error] 5194#5194: *725452 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/update_task HTTP/1.1", upstream: "http://127.0.0.1:6543/api/update_task", host: "tests.stockfishchess.org"
2024/05/14 19:06:30 [error] 5194#5194: *725629 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/beat HTTP/1.1", upstream: "http://127.0.0.1:6543/api/beat", host: "tests.stockfishchess.org"
2024/05/14 19:06:31 [error] 5194#5194: *725904 upstream timed out (110: Connection timed out) while reading response header from upstream, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/beat HTTP/1.1", upstream: "http://127.0.0.1:6543/api/beat", host: "tests.stockfishchess.org"

vdv showed "high" earlier with this data:

#                         route    calls      total    average    minimum    maximum
                     /api/beat     4854  12774.715      2.632      0.002     24.684
              /api/update_task     2673   4104.362      1.535      0.002     24.689
          /api/request_version     1711   1581.257      0.924      0.001     23.757
             /api/request_task      583   1036.156      1.777      0.042     24.179
                 /api/get_elo/        7     29.748      4.250      0.108     24.629

This new "configuration (10 threads, 10 connections, no nginx filtering)" gives:

#                         route    calls      total    average    minimum    maximum
                      /api/beat     5491  10755.127      1.959      0.001     30.119
               /api/update_task     3702   3999.841      1.080      0.001     30.042
              /api/request_task      176    258.019      1.466      0.003     30.042
           /api/request_version      187    151.454      0.810      0.000     29.988
              /api/request_spsa      176    122.247      0.695      0.001     29.984

The averages look better here than for the "high" stats.

The load is going to come in bursts so I think there will always be slow points under heavy load.

It seems that connection_limit must be at least 4 times the threads, otherwise waitress timeout before answering to nginx. even with a low CPU usage.

Timing with fleet at 50k cores (all cores joined) with waitress configuration 6 threads, 100 connections, no nginx filtering

fishtest@tests:~$ sudo bash x/ana_access_log.sh
# logging from [14/May/2024:19:30:04 +0000] to [14/May/2024:19:31:29 +0000] ########
# duration (seconds)            : 85
# calls in total                : 10000
# calls per second              : 117
# calls not reaching the backend:
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
                      /api/beat     6296    117.605      0.019      0.001      0.548
               /api/update_task     3108     47.373      0.015      0.001      0.558
                   /tests/user/       17     17.762      1.045      0.792      1.881
                         /tests       20     15.735      0.787      0.010      1.778
              /api/request_task      111     10.784      0.097      0.014      0.537
                /api/upload_pgn      106     10.036      0.095      0.017      0.848
              /api/request_spsa      149      1.991      0.013      0.001      0.120
                       /actions        2      1.981      0.991      0.935      1.046
                /tests/machines        1      1.733      1.733      1.733      1.733
                   /api/actions       34      1.565      0.046      0.004      0.325
                  /tests/tasks/        9      1.418      0.158      0.057      0.646
           /api/request_version      112      1.013      0.009      0.001      0.199
                   /tests/view/       16      0.939      0.059      0.009      0.245
                /tests/finished        5      0.310      0.062      0.028      0.141
                  /api/get_elo/        5      0.307      0.061      0.017      0.127
               /tests/live_elo/        5      0.124      0.025      0.015      0.034
               /api/active_runs        2      0.120      0.060      0.058      0.062
                  /tests/modify        1      0.063      0.063      0.063      0.063
                 /tests/approve        1      0.036      0.036      0.036      0.036

No failed_task in that list! πŸ˜ƒ

Timing with fleet at 50k cores (all cores joined) with waitress configuration 6 threads, 100 connections, no nginx filtering

fishtest@tests:~$ sudo bash x/ana_access_log.sh
# logging from [14/May/2024:19:30:04 +0000] to [14/May/2024:19:31:29 +0000] ########
# duration (seconds)            : 85
# calls in total                : 10000
# calls per second              : 117
# calls not reaching the backend:
# calls handled by the backend  : 10000
#                         route    calls      total    average    minimum    maximum
                      /api/beat     6296    117.605      0.019      0.001      0.548
               /api/update_task     3108     47.373      0.015      0.001      0.558
                   /tests/user/       17     17.762      1.045      0.792      1.881
                         /tests       20     15.735      0.787      0.010      1.778
              /api/request_task      111     10.784      0.097      0.014      0.537
                /api/upload_pgn      106     10.036      0.095      0.017      0.848
              /api/request_spsa      149      1.991      0.013      0.001      0.120
                       /actions        2      1.981      0.991      0.935      1.046
                /tests/machines        1      1.733      1.733      1.733      1.733
                   /api/actions       34      1.565      0.046      0.004      0.325
                  /tests/tasks/        9      1.418      0.158      0.057      0.646
           /api/request_version      112      1.013      0.009      0.001      0.199
                   /tests/view/       16      0.939      0.059      0.009      0.245
                /tests/finished        5      0.310      0.062      0.028      0.141
                  /api/get_elo/        5      0.307      0.061      0.017      0.127
               /tests/live_elo/        5      0.124      0.025      0.015      0.034
               /api/active_runs        2      0.120      0.060      0.058      0.062
                  /tests/modify        1      0.063      0.063      0.063      0.063
                 /tests/approve        1      0.036      0.036      0.036      0.036

That seems very good...

Timing with fleet at 50k cores (all cores joined) with waitress configuration 6 threads, 100 connections, no nginx filtering.
Extended time stats.

fishtest@tests:~$ sudo bash x/ana_access_log.sh 150000
# logging from [14/May/2024:19:31:13 +0000] to [14/May/2024:19:54:25 +0000] ########
# duration (seconds)            : 1392
# calls in total                : 150000
# calls per second              : 107
# calls not reaching the backend: 17
# calls handled by the backend  : 149983
#                         route    calls      total    average    minimum    maximum
                      /api/beat    84356   1832.117      0.022      0.001      0.785
               /api/update_task    50862    806.108      0.016      0.001      0.801
              /api/request_task     3519    363.280      0.103      0.005      0.678
                         /tests      249    246.445      0.990      0.010      2.283
                /api/upload_pgn     3321    213.469      0.064      0.006      1.848
                   /tests/user/      161    176.090      1.094      0.681      2.544
           /api/request_version     3522     42.400      0.012      0.000      0.476
              /api/request_spsa     2396     37.643      0.016      0.001      0.441
                     /tests/run       19     32.080      1.688      0.511      2.977
                        /upload        4     28.801      7.200      0.007     23.808
                   /api/actions      593     16.545      0.028      0.003      0.238
                  /api/get_elo/      231     11.521      0.050      0.010      0.236
                   /tests/view/      136      8.885      0.065      0.004      0.399
               /tests/live_elo/      219      7.344      0.034      0.001      0.429
                       /api/nn/      256      5.083      0.020      0.002      0.265
                /tests/finished       31      4.194      0.135      0.036      0.230
                       /actions        7      3.418      0.488      0.046      1.737
               /api/active_runs       48      3.170      0.066      0.034      0.208
                  /tests/tasks/       20      3.042      0.152      0.002      0.681
                /tests/machines        1      1.816      1.816      1.816      1.816
                  /contributors        3      0.590      0.197      0.101      0.259
                  /tests/stats/        2      0.587      0.294      0.056      0.531
                 /tests/approve        8      0.377      0.047      0.025      0.087
                           /nns        8      0.187      0.023      0.011      0.040
                    /tests/stop        1      0.147      0.147      0.147      0.147
                          /user        2      0.049      0.025      0.024      0.025
                              /        2      0.038      0.019      0.005      0.033
                        /signup        1      0.021      0.021      0.021      0.021
                        /logout        1      0.020      0.020      0.020      0.020
                  /tests/modify        1      0.018      0.018      0.018      0.018
                      /api/pgn/        2      0.007      0.004      0.003      0.004
               /user_management        1      0.002      0.002      0.002      0.00

Do we still get request_task too busy? I.e. Fishtest blowing off some steam?

May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:43:50 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:47:51 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:48:25 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:51:57 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:51:57 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:51:57 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:51:57 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:51:57 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:55:52 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:55:52 tests.stockfishchess.org pserve[6619]: request_task too busy
May 14 19:55:52 tests.stockfishchess.org pserve[6619]: request_task too busy

My theory is that because 6>5, Fishtest can now handle request_task connections more quickly by possibly rejecting them. This will declog the waitress queue...

EDIT: OTOH the request_task too busy events are not happening all that often. So perhaps they are a red herring.

# Limit concurrent request_task
task_lock = threading.Lock()
task_semaphore = threading.Semaphore(4)
task_time = 0
task_runs = None
worker_runs = {}

If I'm not wrong 6>4, the semaphore allows 4 concurrent requests, then served sequentially because the lock. The fifth request get the "too busy" answer.

EDIT-000: confirmed, with threads=5 the "too busy" is still there:

May 14 20:31:56 tests.stockfishchess.org pserve[12610]: request_task too busy
May 14 20:31:56 tests.stockfishchess.org pserve[12610]: request_task too busy
May 14 20:31:56 tests.stockfishchess.org pserve[12610]: request_task too busy
May 14 20:31:56 tests.stockfishchess.org pserve[12610]: request_task too busy

image
Fleet 75k cores (all cores joined), the server worked fine with waitress configuration at 5 threads, 100 connections. nginx logs clean, events log mostly clean, API timing good, primary instance CPU 55-85% with some short spikes at 95% (it seems just after a run finished).

And in this situation the following server timings:

failed_task 3.58
request_spsa 3.19196
request_task 215.239
request_version 0.241244
update_task 5.50691
upload_pgn 74.2251

and

# logging from [14/May/2024:21:54:06 +0000] to [14/May/2024:21:55:09 +0000] ########
# duration (seconds)            : 63
# calls in total                : 10000
# calls per second              : 158
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum
                      /api/beat     6330    232.114      0.037      0.001      0.551
               /api/update_task     3018    107.901      0.036      0.001      0.903
              /api/request_task      125     29.310      0.234      0.069      0.840
                         /tests        9      8.620      0.958      0.013      1.962
                   /tests/user/        5      6.618      1.324      0.911      1.789
              /api/request_spsa      184      5.478      0.030      0.001      0.461
                       /actions        5      5.456      1.091      0.885      1.255
                /api/upload_pgn       93      5.306      0.057      0.012      0.436
           /api/request_version      127      4.441      0.035      0.001      0.210
                  /tests/tasks/       10      2.481      0.248      0.013      0.776
                   /tests/view/       14      1.395      0.100      0.020      0.359
                  /api/get_elo/        8      0.658      0.082      0.012      0.183
                   /api/actions       37      0.621      0.017      0.002      0.042
               /tests/live_elo/        8      0.608      0.076      0.005      0.291
                       /api/nn/       22      0.254      0.012      0.003      0.065
                /tests/finished        1      0.137      0.137      0.137      0.137
                  /contributors        1      0.098      0.098      0.098      0.098
               /api/active_runs        1      0.086      0.086      0.086      0.086
                        /signup        1      0.002      0.002      0.002      0.002

Those look great. (the 10000 calls)

But I must admit, I don't understand what those "server timings" numbers are that you quote. Are they totals of elapsed time?

failed_task 3.58
request_spsa 3.19196
request_task 215.239
request_version 0.241244
update_task 5.50691
upload_pgn 74.2251

That request_task number is really big!

Those look great. (the 10000 calls)

But I must admit, I don't understand what those "server timings" numbers are that you quote. Are they totals of elapsed time?

failed_task 3.58
request_spsa 3.19196
request_task 215.239
request_version 0.241244
update_task 5.50691
upload_pgn 74.2251

That request_task number is really big!

I think these are the server times as reported by the worker. The are in ms.

I think these are the server times as reported by the worker. The are in ms.

right, just averages computed from what is reported in the worker logs, as described in the first comment #2003 (comment) of this issue.

So current evidence seems to suggest that the culprit was the fact that waitress was configured with fewer threads than the throttling semaphore used by Fishtest (4 <= 4). So the latter was effectively a noop.

IMO not quite, we have had similar numbers for 75k before these changes. Note that the number of workers alone is only part of the story, if most tests are (V)LTC the load on the server is less than if the tests are STC/SPSA.
This just shows this works, and which changes didn't work (see pasquale's list).
Even with this config, the server is very much at the limit, visible when these workers join.

The pattern is quite clear, either the server can (just) keep up with the load and the api times look good, or the server can't keep up with the load and the numbers become very bad quickly. It is basically what you would expect if, for example, the server gets sustained 4 tasks of 0.2s per second, or 6 tasks of 0.2s per second. In the former case, responses will be immediate, in the latter case the waiting time grows unbounded.

for example, the server gets sustained 4 tasks of 0.2s per second, or 6 tasks of 0.2s per second. In the former case, responses will be immediate, in the latter case the waiting time grows unbounded.

This is not true if Fishtest (I mean the app) quickly rejects the request_tasks it cannot handle. Then this de-clogs the waitress queue and we no longer see the dreaded 30s timeouts.

Here is the original commit introducing the semaphore.

2c9e381

certainly, it limits concurrent access to that api, but it seems not to hold that this declogs the waitress queue, at least not sufficiently, as also with this config we see the 30s api timeout.

While I don't know the exact cause of the api timeouts, the semaphore still allows 100% of server capacity to be consumed by request_task, it doesn't limit the number of calls per second, it just limits the concurrent access to the api. Queues can still grow with other api calls (such as heartbeat or update_task), these can still timeout?

But there weren’t any 30s timeouts in the latest log you posted #2003 (comment) (158 calls/s).

Compare this to the log of a similar load (153 calls/s) you posted in the beginning #2003 (comment).

any of these logs can happen.. with the current config, the calls/s doesn't mean so much, it depends on what is being called (cheap calls like get_version / heartbeat vs expensive calls like request_tasks), and the server max times can change quite drastically within seconds (e.g. different after a large test finishes for example).

Here is the original commit introducing the semaphore.

2c9e381

This was the waitress configuration at the time of the semaphore commit, set with these commits c1c086d a932f79:

# Match connection limit with max number of workers
connection_limit = 500
threads = 16

With the distribution probability of the APIs and the timing of the APIs, we should be able to configure the proper queues lengths (at example, why semaphore 4?).

(The comment suggests that at the time fishtest was able to withstand max 500 workers :) )

Could we reduce the waitress time limit now from 30s to something lower? 5s? (Maybe 10-15s first)

If there are hundreds (or more) workers asking for request_task within a second or two, presumably a queue of ~20 forms in each thread. Does the timeout apply to the ones in the queues, and waitress rejects further requests?
I think we want the waitress timeout to be hit at really busy times.

I imagine the fishtest "request_task too busy" is pretty quick, but all 5 threads will (trying to) execute request_task sometimes, so the arriving requests could fill the waitress connections pretty fast.

EDIT: I should have checked first - the semaphore means only 4 threads can be running request_task fully, 5 threads means one will be rejecting requests if they all get them. Hopefully this can do 100s / second.

I guess we could increase that semaphore (and thread count to match) to try to work the server harder. It might be most efficient at less than full cpu load, but that could be a bit higher than 4. Try Semaphore(8) with 10 threads? 16 and 18?

... primary instance CPU 55-85% with some short spikes at 95% (it seems just after a run finished).

Ah, I missed this bit. Maybe Semaphore(4) is good then if we're hitting 95% at times.

I would increase threads slightly (to 8?) to allow other things to happen when lots of request_tasks are arriving. Imagine if an upload happens at the same time.
Connections could prob increase but no idea what to, 200? 500? It would be useful to know how quickly fishtest can reject those extra request_task calls.
And the waitress timeout could come down so that it helps reduce the load if fishtest is strugglng and improves user experience, say 5s but maybe as low as 1s or 2s?

@xoto10

I would increase threads slightly (to 8?) to allow other things to happen when lots of request_tasks are arriving. Imagine if an upload happens at the same time.

At most one thread can enter request_task (I am not sure if this is good or bad). So there are plenty of other threads available.

The bad thing happens if all threads are queuing to enter request_task, but this is prevented by the semaphore.