zcaicaros/L2D

Confused about PPO update

Opened this issue · 0 comments

I'm a bit confused about the PPO update process. In the line 110:
Screenshot from 2024-06-06 11-21-26
The rewards in a single episode ​​are normalized by subtracting the mean while divided by the variance. So why should the rewards be scaled? I found that though normalized, some truly bad rewards are scaled and important information is lost.