Khrylx/PyTorch-RL

Confusion about advantage computation

Closed this issue · 0 comments

Hey!
I'm a bit confused about why in the code to compute advantages, the previous advantage value is being set to the value of the first env's advantage from the previous time step, ie advantages[i, 0]
(assuming that advantages are structured in dimension/size as (time_steps X num_envs X 1 ))

prev_value = values[i, 0]

Could you link the source for the equations for this whole function?
Thanks!
Gunshi