mcx-lab/rl-baselines3-zoo

PPO does not correctly calculate reward on timeout

Opened this issue · 0 comments

SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return.

See discussion here:
DLR-RM/stable-baselines3#651
DLR-RM/stable-baselines3#633

This has been fixed in DLR-RM/stable-baselines3#658
but it has yet to be released in PyPI. For now, the workaround is:

The current workaround is to use a TimeFeatureWrapper (cf. zoo).