google-deepmind/mctx

Question: what value target should be used in MuZero with Gumbel policy?

hr0nix opened this issue · 7 comments

Hello and thank you for this nice library!

I have a question regarding using Gumbel policy in Muzero algorithm. What target for the value prediction should I use in this case? If I use the value backpropped to the root as in the original MuZero paper, this value will not correspond to the improved policy. The paper wasn't clear on that.

Thanks in advance!

To train the value network, the original MuZero uses "a sample return: either the final reward (board games) or n-step return (Atari)."
If you want to use instead a value from the search tree, consider using the Q-value of the selected action:
https://github.com/deepmind/mctx/blob/main/examples/visualization_demo.py#L197

Thanks for a quick response!

either the final reward (board games) or n-step return (Atari)

This problem actually arises with n-step returns as you need to compute the value of the last state to bootstrap.

consider using the Q-value of the selected action

The problem with the approach you propose is that I would have to use a deterministic policy that chooses the best action as a target, otherwise value and policy targets will not be consistent. This is however not the root policy the paper proposes.

Thanks for asking.

  1. MuZero bootstraps from the target value network, not from the search tree.
  2. Notice that the value network trained on some targets will converge to the mean of the targets. For example, the mean of Q-values has a meaning. You do not need a deterministic policy.

MuZero bootstraps from the target value network, not from the search tree.

My bad, you are right. However MuZero Reanalyze or EfficientZero do bootstrap from the search value.

Notice that the value network trained on some targets will converge to the mean of the targets

That is a fair point. But q-value of the most probable action is not a correct Monte-Carlo estimate of the expected value under policy, so it won't converge to the right thing. It could work if I train on the q-value of an action sampled from the policy, but I won't have q-values for the actions that weren't explored during search.

What is the "right" thing you want the value network to converge to? The acting in the environment is done with the selected action. The mean of the Q-values will be the model-based-value of the policy used for this acting. So an unrestricted value network trained to model the Q-value of the selected action would converge to the model-based-value of the policy used for acting.

Notice that I do not recommend to use the Q-value of the "most probable action". The selected action is not the always the most probable action.

My bad again, I should have read your response more carefully 🤦‍♂️ This would be a correct estimator for the policy value indeed if the q-values are estimated correctly.

I guess one problem with this approach might be that the q-values of the actions not visited during search will not be approximated well, but I'm not sure if this will matter in practice.