value_bootstrap correctness

Value bootstrap is calculated here:

Line 618 in 92525ce

    
           shaped_rewards += self.gamma * res_dict['values'] * self.cast_obs(infos['time_outs']).unsqueeze(1).float()

Essentially, what the code does is:

a(t) = actor(obs(t))
v(t) = critic(obs(t))
obs(t+1), rew(t), is_timeout(t) = env.step(a(t))
rew(t) += gamma * v(t) * is_timeout(t) (1)

where t is the index of the timestep in the episode according to which timestep we populate in self.experience_buffer (hope my notation is clear)

The idea here is that we should add the estimated return for the rest of the episode as if it was infinitely long.
So, ideally,

rew(t) += gamma * v(t+1) (2)

instead of rew(t) += gamma * v(t) as in (1). Using (1) is undesirable because v(t) already accounts for rew(t) and so if the environment returns a large reward on the last step it will be accounted for twice.

The thing is that we can't really get v(t+1) = critic(obs(t+1)) because if is_timeout(t) is true, done(t) will also be true, which means obs(t+1) corresponds to the next episode.

We can't estimate v(t+1) using v(t) either because v(t) = rew(t) + gamma * v(t+1) ==> gamma * v(t+1) = v(t) - rew(t)
When we use this in the equation (2) above we get:

rew(t) += v(t) - rew(t) (3)

this just sets rew(t) to v(t) which entirely discards rew(t).

Basically this leaves us with just two options for value bootstrap, which is:

add rew(t) essentially twice as currently done in the codebase (equation (1))
ignore rew(t) entirely and use just the v(t) estimate for the last step (equation (3))

I feel like both options are really hacky and I wonder if there's even a right way to do it. What do you think? Am I missing something here?

On the other hand, both of these options are viable as long as rew(t) on the last step of the episode is negligible. If it is not, i.e. if the environment returns some non-trivial reward when is_timeout(t) is true, both options lead to incorrect learning behavior.

I will test n2.
The first option was used in some google/deepmind's code as far as I remember.
Variant 2 sounds like less hacky for me. It might be almost no difference for some IG envs because reward function is pretty smooth and there are no big jumps.

I guess this is more theoretical issue than practical.
IG envs indeed have smooth reward signal, and ignoring last step reward or adding it twice isn't going to change anything in an episode with 1000's of steps.

But the current approach will break if the environment provides large terminal reward at timeout (i.e. +10 when the task is solved, -10 when it is not solved).

Anssi Kanervisto pointed me to this: https://github.com/DLR-RM/stable-baselines3/pull/658/files#diff-384b5f21f2bed58d1d6e64da04a42fee52f353fcec38bf410338524336657bd8R194-R205
They have access to the special "terminal_observation", so that solves the problem correctly I guess :)

My expectation was that if something is solved or failed in the end is_timeout should be False. It should be true only when is everything seems fine except it is last episode :)
Terminal observation might be a good solution too. But we have thousands of the envs and main goal is the performance this change may introduce noticeable perf regression. But definetely need to evaluate this option.

Alright, thank you for the feedback! I agree that if something major happens in the last frame of the episode, we should not use value_bootstrap.

For now maybe just adding a warning telling the user about this would be sufficient.