PPO does not correctly calculate reward on timeout
Opened this issue · 0 comments
dtch1997 commented
SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return.
See discussion here:
DLR-RM/stable-baselines3#651
DLR-RM/stable-baselines3#633
This has been fixed in DLR-RM/stable-baselines3#658
but it has yet to be released in PyPI. For now, the workaround is:
The current workaround is to use a TimeFeatureWrapper (cf. zoo).