vwxyzjn/cleanrl

Correct handling of `termination` vs `truncation`?

Opened this issue · 1 comments

Hi, thank you so much for the CleanRL resource!

I have a question regarding the PPO implementation and how it handles the difference between episodes that ended because it was terminated (it completed the task) or truncated (it ran out of time).

A comment in the advantage calculation suggests that episodes that are not done are to be bootstrapped from the value function.

At the same time, both truncations and terminations are or'd together so both cases are counted as the same type of done:

next_done = np.logical_or(terminations, truncations)

This seems to go against other findings/implementations: Time Limits in Reinforcement Learning, StableBaselines3.

Is the difference here that you assume that we're operating in environments with an actual episode timeout so that truncations mean failure? In other cases, there is no inherent sense of time-limit, only a designer desire for faster task solving, in which I think it makes sense to handle truncations separately.

Have I understood all of this correctly?

I believe this is being fixed here - #448