Possible bug
lepmik opened this issue · 4 comments
Hi, first of all thank you for your blog post and nice, readable code! I have been using your example to rewrite some other RL implementation from TF1 to TF2. When I was comparing advantages though, I found some differences. It seems to me that your advantage estimations from
should bereturns = np.append(values, next_value, axis=-1)
Hello,
The returns array is initialized on that line, and is calculated in subsequent lines. Each value in that array represents cumulative sum of rewards from it's time until the end plus bootstrap next_value
.
np.zeros_like
uses input array's dtype by default. In the blog post code rewards
has dtype np.float32
, but in your case it's np.int32
. This results in the gamma * returns[t+1]
part being truncated to zero, effectively making returns
array into a copy of rewards
.
To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.])
.