marload/DeepRL-TensorFlow2

Reward modification in PPO

Ynjxsjmh opened this issue · 2 comments

state_batch.append(state)
action_batch.append(action)
reward_batch.append(reward * 0.01)
old_policy_batch.append(probs)

state_batch.append(state)
action_batch.append(action)
reward_batch.append((reward+8)/8)
old_policy_batch.append(log_old_policy)

In PPO_Discrete each reward is multiplied by 0.01 and in PPO_Continuous reward is also modified. I don't understand why do these modification, what does these modification do?

same question

乘0.01应该是减小奖励,使其保持在0-1之间(我猜测)