Confusing use of bootstrap_value
Serious-Joker opened this issue · 1 comments
Hello! Thanks for posting the code!
I am confused about why the bootstrap_value is used like it is.
In experiment.py line 354 the last element of the learner_outputs.baseline is set to the bootstrap_value:
bootstrap_value = learner_outputs.baseline[-1]
then further down on line 378 the entire learner_outputs.baseline
is sent to the vtrace algorithm:
vtrace_returns = vtrace.from_logits(
behaviour_policy_logits=agent_outputs.policy_logits,
target_policy_logits=learner_outputs.policy_logits,
actions=agent_outputs.action,
discounts=discounts,
rewards=clipped_rewards,
values=learner_outputs.baseline, # <-- here
bootstrap_value=bootstrap_value)
In the implemented vtrace algorithm, the bootstrap_value is used only twice to create values_t_plus_1
and vs_t_plus_1
.
I'm confused why the bootstrap_value
is used in this way. Why not pass in the values as is and use the last element when necessary?
Also to be clear, does this mean that values_t_plus_1
is approximated by tf.concat([values[1:], tf.expand_dims(values[-1], 0)], axis=0)
?
The separation is for mostly historical reasons, you could indeed avoid have a separate bootstrap_value and just pass it with the values/baseline.
Also to be clear, does this mean that values_t_plus_1 is approximated by tf.concat([values[1:], tf.expand_dims(values[-1], 0)], axis=0)?
No, the values/baseline passed to V-trace does not have the bootstrap_value. It is removed here: https://github.com/deepmind/scalable_agent/blob/master/experiment.py#L363