Monte-Carlo tree search as regularized policy optimization

Question

Monte-Carlo tree search as regularized policy optimization

Opened this issue 4 years ago · 3 comments

The combination of Monte-Carlo tree search
(MCTS) with deep reinforcement learning has
led to significant advances in artificial intelligence. However, AlphaZero, the current stateof-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s
search heuristics, along with other common ones
such as UCT, are an approximation to the solution of a specific regularized policy optimization
problem. With this insight, we propose a variant
of AlphaZero which uses the exact solution to
this policy optimization problem, and show experimentally that it reliably outperforms the original
algorithm in multiple domains.

http://proceedings.mlr.press/v119/grill20a/grill20a.pdf
http://proceedings.mlr.press/v119/grill20a/grill20a-supp.pdf

Answer 1 · 2021-03-07T10:08:04.000Z

Initial implementation seems to work, but without a scheme such as virtual visits does not allow for sufficient parallelism to allow for acceptable performance.

Answer 2 · 2021-12-21T03:18:25.000Z

Hi Chicoryn,
I am also interested in this paper. I wonder what does virtual visits here mean? I guess the original dichotomy search scheme scales well with the number of legal actions (complexity of ~log(n)), and dding it to the action selection stage won't cost too much.

Answer 3 · 2021-12-21T13:20:46.000Z

@peppacat Hello, this was some time ago so I must admit I've forgotten a lot of the implementation details of my experiments. But virtual visits normally refers to a penalty [1] added to leaf nodes that are currently being expanded to avoid the next probe into the search tree from trying to expand the same node again. This is very important in real play scenarios where we are not restricted by the number of probes into the search tree, but rather by a real-time limit so it's better to fill each batch fully, instead of running a half-full batch through the value function.

As for your real question, I am not sure about the answer since as you indicate it seems like it should be fairly straight forward to introduce virtual losses into equation (3). I suspect that I did not have sufficient time / energy to investigate this further.

[1] https://github.com/Chicoryn/dream-go/blob/master/src/libdg_mcts/tree.rs#L1368