inoryy/tensorflow2-deep-reinforcement-learning

Possible bug

lepmik opened this issue · 4 comments

Hi, first of all thank you for your blog post and nice, readable code! I have been using your example to rewrite some other RL implementation from TF1 to TF2. When I was comparing advantages though, I found some differences. It seems to me that your advantage estimations from

returns = np.append(np.zeros_like(rewards), next_value, axis=-1)
should be

returns = np.append(values, next_value, axis=-1)

Hello,
The returns array is initialized on that line, and is calculated in subsequent lines. Each value in that array represents cumulative sum of rewards from it's time until the end plus bootstrap next_value.

Hi, sorry, I was fooled by what seems like a numpy bug in zeros_like and jumped to conclusions without properly checking the code. Anyway, here is the background from my confusion: as you can see, using zeros_like seems to mess things up

image

np.zeros_like uses input array's dtype by default. In the blog post code rewards has dtype np.float32, but in your case it's np.int32. This results in the gamma * returns[t+1] part being truncated to zero, effectively making returns array into a copy of rewards.

To "fix" the issue either use floats or explicitly specify the dtype when defining the rewards array, e.g. rewards = np.array([0., 0., 1., 1., 0., 1.]).