kblomdahl/dream-go

Reduce the number of OS threads used during search

Closed this issue · 2 comments

There are roughly three places where threads are spawned in Dream Go:

  • We extract features from multiple games in parallel in DataSet or self_play. Each game can spawn multiple search threads.
  • We evaluate several possible moves from the search tree in parallel.
  • We evaluate several batches of possible moves in parallel on the GPU.

There are several issues with this approach, the first is that we usually want to play a huge number of games at the same time to ensure good GPU saturation. This means running at least 512 search threads in total, which needs to be saturated by playing multiple games since we can only reliably saturate 4 search threads per game (due to bottlenecks during the opening and end-game). So if we want to run 512 search threads we need to run 640 threads in total (not counting the batch threads, which is fairly small anyway).

This is a fairly large number of context switches per second, and can cause a fairly large amount of synchronisation overhead.

The second issue is that there is currently only one knob to tune all of this, the NUM_THREADS environment variables, which currently gets scaled by different factors in different parts of the program to determine the number of threads to actually use.


To fix this, while keeping the code somewhat clean, we would need to change the architecture to rely on one of the following two popular approaches:

  • Future based event loop
  • Cooperative Threading using fibers / coroutines

In either case we would spawn NUM_THREADS number of actual working threads that schedule the event loop of fibers using a workstealing algorithm. This would allow us to spawn infinity number of threads in each of the mentioned components without having to worry about the overhead (and locks can be replaced with spinlocks since we do not need to worry about a badly timed context switch).


An alternative would be to just introduce additional environment variable that we read the number of threads to use in each component from. This would allow us to control everything perfectly without the complexity of the previous approach, but we would still suffer from the overhead associated with OS threads.

These variables could of course be introduced in the previous approach to to reap the benefit of both.

I've introduced some new command-line arguments to help deal with this:

  • --num-games - determine the number of games to play in parallel.
  • --num-threads - determine the total number of MCTS threads.

We then determine the number of threads to use in each search tree as NUM_THREADS / NUM_GAMES. This makes it possible to control the number of threads, even if we still end-up with a lot of threads.

After adding the extra command line arguments there does not seem to be any practical issues anymore since we bottleneck at neural network evaluations anyway, so the context switch overhead is minimal.

I might re-open this at some point in the future if I want to implement cooperative threading but it does not seem worth the effort at this moment.