google-deepmind/scalable_agent

Confusing use of bootstrap_value

Serious-Joker opened this issue · 1 comments

Hello! Thanks for posting the code!
I am confused about why the bootstrap_value is used like it is.

In experiment.py line 354 the last element of the learner_outputs.baseline is set to the bootstrap_value:

bootstrap_value = learner_outputs.baseline[-1]

then further down on line 378 the entire learner_outputs.baseline is sent to the vtrace algorithm:

vtrace_returns = vtrace.from_logits(
    behaviour_policy_logits=agent_outputs.policy_logits,
    target_policy_logits=learner_outputs.policy_logits,
    actions=agent_outputs.action,
    discounts=discounts,
    rewards=clipped_rewards,
    values=learner_outputs.baseline,   # <-- here
    bootstrap_value=bootstrap_value)

In the implemented vtrace algorithm, the bootstrap_value is used only twice to create values_t_plus_1 and vs_t_plus_1.

I'm confused why the bootstrap_value is used in this way. Why not pass in the values as is and use the last element when necessary?

Also to be clear, does this mean that values_t_plus_1 is approximated by tf.concat([values[1:], tf.expand_dims(values[-1], 0)], axis=0)?

The separation is for mostly historical reasons, you could indeed avoid have a separate bootstrap_value and just pass it with the values/baseline.

Also to be clear, does this mean that values_t_plus_1 is approximated by tf.concat([values[1:], tf.expand_dims(values[-1], 0)], axis=0)?

No, the values/baseline passed to V-trace does not have the bootstrap_value. It is removed here: https://github.com/deepmind/scalable_agent/blob/master/experiment.py#L363