Rafael1s/Deep-Reinforcement-Learning-Algorithms

Reward shaping not removed in evaluation in CarRacing-From-Pixels-PPO

Closed this issue · 5 comments

Hi,

The figure and log in README shows scores >1000, which due to the CarRacing's design, is not quite possible.
It turns out that the reward shaping in Wrapper.step() is not removed in evaluation and that leads to incorrect results.
Commenting out relevant lines, I got an average score of 820 over 100 episodes.

Thanks for your note,
Could you info that the relevant lines should be commented out or your version of Wrapper.step()

Hi,
The following snippet is what I used for evaluation, let me know if this makes sense to you :)

    def step(self, action):
        total_reward = 0
        for i in range(action_repeat):
            img_rgb, reward, die, _ = env.step(action)
            # don't penalize "die state"
#            if die:
#                reward += 100
            # green penalty
#            if np.mean(img_rgb[:, :, 1]) > 185.0:
#                reward -= 0.05
            total_reward += reward
            # if no reward recently, end the episode
#            done = True if self.av_r(reward) <= -0.1 else False
            done = False
            if done or die:
                break
        img_gray = rgb2gray(img_rgb)
        self.stack.pop(0)
        self.stack.append(img_gray)
        assert len(self.stack) == img_stack
        return np.array(self.stack), total_reward, done, die

Hi,
I cannot agree with your version. For example, where your "green penalty"? You need penalize the car driving to the green field. Possibly, the green threshold should be lower than 185, or reward should be more accurate than -0.05.

Let's see if the followings can describe my points better.

  1. There are 2 versions of Wrapper.step(), 1 for training and 1 for evaluation.
  2. You can add whatever reward shaping in the training version. E.g., penalty for driving to grass.
  3. You should not add reward shaping in the evaluation version. E.g., CarRacing is considered solved when avg reward > 900, but it is not very fair if you add 100 upon die==True or end the episode earlier if you notice the car is not running well, right?

The code snippet I used was for evaluation.

@lerrytang
Let us look at the OpenAI CarRacing environment code
https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py
lines 337-339

If the car went outward the field, the reward is penalized by -100.
However, if the track is over (absolutely successful case) the reward ALSO penalized by -100
Then, for fairness, we restore the reward by +100 in Wrapper.step()