PPO does not correctly calculate reward on timeout

Question

Opened this issue 3 years ago · 0 comments

SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return.

This has been fixed in DLR-RM/stable-baselines3#658
but it has yet to be released in PyPI. For now, the workaround is:

The current workaround is to use a TimeFeatureWrapper (cf. zoo).