quantumiracle/Popular-RL-Algorithms

Why does PPO every training result in the same reward chart? This puzzles me very much.

Closed this issue · 6 comments

Hi,
Please provide more details about the code you used.
Did you take multiple rounds of training with the same run? If so, using exactly the same plt.figure will resulting in multiple curves on the same plot.

`def train():
env = gym.make(ENV_NAME).unwrapped
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
drawer = Drawer()
# reproducible
#env.seed(RANDOMSEED)
#np.random.seed(RANDOMSEED)
#torch.manual_seed(RANDOMSEED)

ppo = PPO(state_dim, action_dim, method = METHOD)
global all_ep_r, update_plot, stop_plot, Angle, OPT_ANGLE
all_ep_r = []
Angle = []
OPT_ANGLE = []
for ep in range(EP_MAX):
    s = env.reset()
    ep_r = 0
    t0 = time.time()
    for t in range(EP_LEN):
        if RENDER:
            env.render()
        a = ppo.choose_action(s)
        ti = time.time()
        s_, S_temp, r, done, _ = env.step(s,a,ti)  #px
        ppo.store_transition(s, a, (r + 8) / 8)  # useful for pendulum since the nets are very small, 
        #normalization make it easier to learn
        s = s_
        ep_r += r
        angle, speed, height = s #px
        # update ppo
        if len(ppo.state_buffer) == BATCH_SIZE:
            ppo.finish_path(s_, done)
            ppo.update()
        if done:
            break
    ppo.finish_path(s_, done)
    print(
        'Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'.format(
            ep + 1, EP_MAX, ep_r,
            time.time() - t0
        )
    )
    if ep == 0:
        all_ep_r.append(ep_r)
    else:
        all_ep_r.append(all_ep_r[-1] * 0.9 + ep_r * 0.1)
        OPT_ANGLE.append(S_temp) #px
        Angle.append(angle) #px
    if PLOT_RESULT:
        update_plot.set()


ppo.save_model()
if PLOT_RESULT:
    stop_plot.set()
env.close()`

After I annotated lines 7 to 9, I found that the curve of each training is no longer the same. Is it correct for me to modify it like this,please?

I'm afraid that the problem is not caused by the code you adopted from this repo. And I'm not clear what plotting function did you use, and which environment your are working on.

I guess you mean s = env.reset() for line 7, the environment reset is standard in RL and should not be removed in general. Maybe you are using a very deterministic environment without any noise, then probably the learning curve will show up to be the same if the model uses exactly the same samples for update during the whole learning process. But it looks to me this is less likely to happen because in choose_action there is some randomness in sampling. So I would say check more of the plotting code you used.

`class Drawer:
def init(self, comments=''):
global update_plot, stop_plot
update_plot = threading.Event()
update_plot.set()
stop_plot = threading.Event()
stop_plot.clear()
self.title = ARG_NAME
if comments:
self.title += '_' + comments

def plot(self):
    plt.ion()
    clear_output(True) #px1013
    global all_ep_r, update_plot, stop_plot, Angle, OPT_ANGLE
    all_ep_r = []
    Angle = []
    OPT_ANGLE = []
    while not stop_plot.is_set():
        if update_plot.is_set():
            plt.figure(num=1,figsize=(20,5))
            plt.cla()
            plt.title('Reward') #px
            plt.plot(all_ep_r)
            # plt.ylim(-2000, 0)
            plt.xlabel('Episode')
            plt.ylabel('Moving averaged episode reward')
            plt.savefig(os.path.join('fig','Morphing reward_' + time_str))
            plt.figure(num=2,figsize=(20,5))
            plt.cla()
            plt.title('Angle')
            x=list(range(0,len(Angle)))
            plot1 = plt.plot(Angle, 'r-', label = 'angle')
            plot2 = plt.plot(OPT_ANGLE, 'b--', label = 'opt_angle')
            # plt.ylim(-2000, 0)
            plt.xlabel('Episode')
            plt.ylabel('Morphing Angle')
            plt.savefig(os.path.join('fig','Morphing Angle_' + time_str))
            plt.legend()
            update_plot.clear()
            #px
        plt.draw()
        plt.pause(0.1)
    plt.ioff()
    plt.close()`

This is the drawing code I used. I don't think there's anything strange. Could you please help me have a look?

I think the problem lies in this code
env.seed(RANDOMSEED) np.random.seed(RANDOMSEED) torch.manual_seed(RANDOMSEED)
Because the random number seed is used, the random number generated by the system is the same every time, so the action selection of each step is the same. I don't know if I am right?

If it's like you said, you can simply verify it by using different random seeds.

Ok,i will try it. Thanks for your help!