standardize the advantage function
Closed this issue · 1 comments
onlytailei commented
in models,py
# Standardize the advantage function to have mean=0 and std=1
advants_n = np.concatenate([path["advants"] for path in paths])
# advants_n -= advants_n.mean()
advants_n /= (advants_n.std() + 1e-8)
Is that a typo that you comment up the advants_n -= advants_n.mean()? Cause you said the the mean value of advantage function should be 0.
YunzhuLi commented
I found the normalization in John Schulman's implementation of trpo. It is a way of stabilizing the gradients during back propagation. Yet this normalization will introduce bias, which is why I did not subtract the mean.