Why clip Rewards?

Question

Why clip Rewards?

Closed this issue 7 years ago · 2 comments

I was wondering why clipping the rewards improves the performance.....the rewards for the Breakout environment (using OpenAI gym) is already limited between [-1, 1]. Could it be that the performance difference is due to the gradient normalization only?

I also noticed that you use tf.clip_by_average_norm instead of tf.clip_by_global_norm. Have you tried the latter? It's just that in other A3C implementations that I have seen, it is far more common to use the latter, and that made me wounder if there is any specific to use clip_by_average_norm.

Anyways, congratulations! Great work!

Answer 1 · 2017-09-22T20:03:04.000Z

Hi,

Actually, a reward is not clipped by the environment. If you watch the play closely, the blocks on the top row give a higher score. The main reason clipping a reward is that neural network is not good at fitting to data not having zero mean. The bias should handle this (by setting higher learning rate?), but it would be better just normalize it. I see many cases where just scaling a reward is the key to a training.
I haven't tried the other version, but I guess it won't make a big difference. I just chose it because that was the function I found from the tensorflow documentation.

Thanks!

Answer 2 · 2017-09-22T21:37:31.000Z

Oh my......I didn't know that the blocks on the top give higher scores! That is truly a game changer for me! I've been having a LOT of trouble with Breakout using my code, and I have already compared it with about 7 or 8 other implementations of the A3C. But somehow I missed the reward clipping in all of them.....I already heard that clipping rewards is important, but I really thought that the Gym already made that in the backstage.

I will try this in my code and see what happens.

I am truly grateful for the help!