lichess-org/fishnet

Work starvation on high core count instances

cyqsimon opened this issue · 3 comments

I just finished setting up fishnet on my two servers today and spotted some interesting behaviour. It seems like fishnet's work assignment algorithm is poorly optimised for high core count systems. The workers are very often starved for long periods of time between short bursts of activity.

Typical CPU utilisation graphs

These are screenshots from btop, with each tick representing 500ms.

On the 32-core instance

Typical CPU utilisation on the 32-core instance

And there are moments where we see full utilisation as well:

Full utilisation on the 32-core instance

The starvation problem exists and is observable, but doesn't look too bad. Yet.

On the 128-core instance

Over here it's a different story. You can clearly see the work starvation, and it remains consistently this bad over my few hours of observation.

Typical CPU utilisation on the 128-core instance

And those screenshots are taken at about the same time, so I don't think this is a case of "no work available in the pool". It seems weird that while the 32-core instance is being hit with a full load, the 128-core instance is still taking large breaks.

And I did try running multiple instances with fewer cores each instead of a single 128-core instance, but I very quickly ran into the API rate limit.

Additional information

CPU: Epyc 9684 (96C192T)
OS: RockyLinux 9.3
Kernel: 5.14.0-362.18.1.el9_3.0.1.x86_64
Container engine: podman 4.6.1
Fishnet: Latest on docker hub (2.9.2)
Stockfish build chosen: stockfish-x86-64-vnni256

If you require any further help with debugging and/or testing I would love to be of assistance.

Thanks for reporting. Can you please try running with -v to get more detailed logs?

Can reproduce the pattern, and there just wasn't work in the queue. But please reopen if -v tells a different story.

Can reproduce the pattern, and there just wasn't work in the queue. But please reopen if -v tells a different story.

Yeah on further observation that does seem to be the case. I checked Lichess API (https://lichess.org/fishnet/status) and indeed the queued fields are always 0 or near 0 when work starvation is happening. Thanks for investigating.