pytorch/ELF

[Suggestion] clarify FPU in paper

alreadydone opened this issue ยท 4 comments

First-play urgency (FPU) is the Q assigned to unvisited nodes during tree search. There are several choices of this parameter:

  • the parent node's Q (average over whole subtree)
  • the parent node's net eval (value head output at a single position, used in Minigo's earlier runs)
  • one of the above minus a quantity depending on visited policy (Leela Zero is parent net eval minus fpu_reduction)
  • constantly the loss value (used in AGZ and AZ as Matthew Lai from DeepMind clarified)
  • constantly the draw value (verified to be unreasonable by Minigo v11 run)

Since ELF data says "root_unexplored_q_zero":false,"unexplored_q_zero":false I assume it's not the third approach. Could you please clarify, hopefully also in a future version of the paper, the FPU setting you've used? It is an important parameter, since apparently Minigo [1] (late v14 and all subsequent runs) and Leela Chess Zero (late Test30 and all subsequent runs) found large benefit from the FPU = loss setting; FPU could affect ladder behavior as well: FPU = loss probably helps with deep tactical lines. Thanks for your attention.

[1] https://github.com/tensorflow/minigo/blob/master/RESULTS.md#v14-bigtable-integration-with-init-to-loss-switch

Around model 280, we discovered on LCZero forums that AlphaGoZero had used init-to-loss for Q values. We switched to init-to-loss from init-to-parent and saw our network gain dramatically in strength. Soon, our top model thoroughly exceeded our previous best model

That's a good question.

It uses none of the above. OpenGo uses the current running mean Q of the parent node when the child node is created. If there is no current running mean Q, use net val (or draw value if root_unexplored_q_zero (at root layer) or unexplored_q_zero is true). See:

https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/tree_search_node.h#L298
https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/tree_search_node.h#L227

I don't know whether it is a good choice. But looks like it works.

Thanks for the clarification! The current running mean Q of the parent is basically what I meant by

the parent node's Q (average over whole subtree)
in my original post. But then the question seems to be: when is the child node created? I can only think of two possibilities:

  • either it is created when the parent is expanded, i.e. right after the policy and value data of the parent is ready, so that the parent Q is just the net val,
  • or it is created when it is selected by the UCT algorithm.

However in the second case, in order for it to be selected, its Q would have already been used to calculate the action value Q+U, and the crucial problem is what Q was used to make the selection. I guess what you mean is that the parent's running mean Q is used to make the selection; if so this is exactly what I meant by the parent node's Q in my OP. If so, I guess that the parent's running mean Q would also be used in previous rollouts that reached the parent but didn't select this child, and this running mean could be different each time. (However, since you only need to compute Q+U of the unvisited child of the highest policy prior, each child's Q only needs to be called once before it's expanded.) Let me know if there's something I don't get. (P.S. I spent some time to read the code before but I am not sure I completely get the behavior.)

It is the second case (i.e., when the child node is selected and created by the UCT algorithm).

Exactly, the running mean is different for each child being created. The reason is that we want to use the most accurate estimate so far for the newly created children.

Totally clear; thanks again!