Remove all locks from the monte carlo tree search
Closed this issue · 1 comments
Today the monte carlo tree search (MCTS) spend at least 30% of its time in lock contention, mainly due to the root node, and time control. There are some prior art on lockless algorithms for MCTS [1] [2], but none of the can be directly applied to our implementation due to how we compress the tree #36.
A few thoughts on how we can accomplish different parts of this:
- RCU would allow us to get ride of the locks necessary to block on upgrade from small to big node structure, which should be a very rare event.
- If we do not want to implement an RCU when a read-write lock would suffice, but is less desirable.
- Changing the node structure from an object of arrays to an array of objects would allow us to update statistics using 128-bit atomic ops. If we are willing to accept a non-deterministic search then we do not need to do this, and just normal atomic ops are sufficient when calculating the UCT.
[1] S. Ali Mirsoleimani, Jaap van den Herik, Aske Plaat, and Jos Vermaseren, A Lock-free Algorithm for Parallel MCTS, http://liacs.leidenuniv.nl/~plaata1/papers/paper_ICAART18.pdf
[2] Markus Enzenberger, and Martin Muller, A Lock-free Multithreaded Monte-Carlo Tree Search
Algorithm, https://webdocs.cs.ualberta.ca/~mmueller/ps/enzenberger-mueller-acg12.pdf
An initial implementation of RCU (using the quiescent approach) for expanding small nodes to big nodes seems to yield a reduction of CPU usage from 1,300% to 800% (with no change in playout performance, since that is GPU bound anyway).
There are still three locks present:
- Lock for updating the average value, since it involves reading three different variables. It might be possible to replace the lock with a CAS on value for the update, but we need to read the (value, count) tuple atomically.
Lock for finding an empty spot, and inserting, a child into a small node. This can probably be done with a CAS on indices, which would make it equivalent to a seqlock.Lock for the RCU update queue. This can probably be fixed with an atomic queue, or maybe some alternative implementation that does not need to maintain a queue of updates.
Overall, only the third lock is bad, and should be pretty easy to fix since the current implementation is mostly a proof of concept.