vwxyzjn/cleanrl

Bug in RND Intrinsic Reward Normalization

akarshkumar0101 opened this issue · 1 comments

Problem Description

The CleanRL has the following lines of code:

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy().T]
)
mean, std, count = (
    np.mean(curiosity_reward_per_env),
    np.std(curiosity_reward_per_env),
    len(curiosity_reward_per_env),
)
reward_rms.update_from_moments(mean, std**2, count)

curiosity_rewards /= np.sqrt(reward_rms.var)

curiosity_rewards has shape (num_steps, num_envs).
Then, when for loop over curiosity_rewards.cpu().data.numpy().T, we are looping over the num_envs, not num_steps.
So, we are calculating the intrinsic "return" w/ running exponential sum of rewards over environments.
We're supposed to calculate the intrinsic "return" w/ running exponential sum of rewards over timesteps.

It ends up not mattering too much, since we only need an approximate scale of the reward to normalize them, but is still wrong.

In the original implementation of RND:
They do rffs_int = np.array([self.I.rff_int.update(rew) for rew in self.I.buf_rews_int.T]),
but their buf_rews_int has shape (num_envs, num_steps), so it makes sense to do the transpose.
In CleanRL, the shape is already reversed, so transposing makes it loop over environments.

Another issue is that in the original implementation, they update the RMS with all the rewards in the batch using self.I.rff_rms_int.update(rffs_int.ravel()).
CleanRL updates the RMS using count=len(curiosity_reward_per_env) which is equal to num_envs.

Possible Solution

I have fixed both of these issues in the following possible solution.

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy()]
)
reward_rms.update(curiosity_reward_per_env.flatten())

curiosity_rewards /= np.sqrt(reward_rms.var)

I also opened up a pull request with these fixes.

@yooceii @vwxyzjn

Hi Akarsh, it's indeed a bug. Can u provide a comparison plot of before and after the fix? (I don't see PR tho)
At the same time, can u also rename curiosity_reward_per_env to curiosity_reward_per_step