shindavid/AlphaZeroArcade

High-Temperature Shallow Game Branching

Opened this issue · 1 comments

Replicate the following idea from Appendix D of the KataGo paper:

In 2.5% of positions, the game is branched to try an alternative move drawn randomly from
the policy of the net 70% of the time with temperature 1, 25% of the time with temperature
2, and otherwise with temperature infinity. A full search is performed to produce a policy
training sample (the MCT S search winrate is used for the game outcome target and the score
and ownership targets are left unconstrained). This ensures that there is a small percentage
of training data on how to respond to or refute moves that a full search might not play.
Recursively, a random quarter of these branches are continued for an additional move.

Note: it is likely that this comment from the KataGo paper applies to this idea:

Except for introducing a minimum necessary amount of entropy, the above settings very likely have
only a limited effect on overall learning efficiency and strength. They were used primarily so that
KataGo would have experience with alternate rules, komi values, handicap openings, and positions
where both sides have played highly suboptimally in ways that would never normally occur in
high-level play, making it more effective as a tool for human amateur game analysis.