Poor Evaluation Performance in PPO

Question

Poor Evaluation Performance in PPO

Opened this issue 8 months ago · 5 comments

sdpkjc commented 8 months ago

Problem Description

I have been encountering an issue with the poor evaluation performance when using the PPO model from HuggingFace.

https://huggingface.co/cleanrl/Hopper-v4-ppo_continuous_action-seed1

mean_reward on Hopper-v4 3.83 +/- 5.28

Upon inspecting the TensorBoard curves provided within the HuggingFace repository, the roll-out data appeared to be normal, which has left me somewhat puzzled.

To determine whether this was a result of randomness, I ran experiments with three different random seeds. Certainly, the evaluation performance remained consistently poor across all these different runs. To probe further, I performed a full-course evaluation experiment on the PPO model, again with three different random seeds. Intriguingly, the evaluation performance began normal but then considerably depreciated over time.

This situation eerily resembles an overfitting issue, although that seemed improbable since I tried to rule out data correlation by running a parallel experiment across four environments. Yet, the poor evaluation performance persisted.

I would appreciate any insights into this issue or possible suggestions towards resolving it.

Checklist

I have installed dependencies via poetry install (see CleanRL's installation guideline.
I have checked that there is no similar issue in the repo.
I have checked the documentation site and found not relevant information in GitHub issues.

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

Answer 1 · 2023-10-17T15:52:35.000Z

My bad for not looking closely in #423. I think the issue is that the normalize wrappers have states which are not saved. See #310 (comment)

Answer 2 · 2023-10-17T16:23:59.000Z

Thank you so much for your response, this indeed is an intriguing issue. I have gone through #310 and it makes sense that we need to save obs_rms and return_rms.

However, pickling and uploading them directly on HuggingFace might not be the most elegant solution, not to mention we would need to modify the enjoy.py workflow.

I am considering that since these are part of what the agent has learnt, we could save these two values within the agent object itself. This way, when saving, they can be pickled along with the cleanrl_model file, and would make the restoration process more efficient.

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, np.prod(envs.single_action_space.shape)), std=0.01),
        )
        self.actor_logstd = nn.Parameter(torch.zeros(1, np.prod(envs.single_action_space.shape)))
        
        self.env_obs_rms = envs.envs[0].obs_rms
        self.env_return_rms = envs.envs[0].return_rms

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)

There is also the question of whether only saving obs_rms and return_rms for the first environment is sufficient for multiple parallel environments.

Answer 3 · 2023-10-17T19:24:38.000Z

Alternatively, we could just implement the normalize wrappers ourselves: https://github.com/openai/phasic-policy-gradient/blob/7295473f0185c82f9eb9c1e17a373135edd8aacc/phasic_policy_gradient/reward_normalizer.py#L8-L39

Answer 4 · 2023-12-07T19:20:52.000Z

I want to confirm whether we need to save the NormalizeReward_wrapper. If we don't save it, our policy won't be able to continue training after being downloaded and can only be used without further training. Saving it requires adding both NormalizeObservation_wrapper and NormalizeReward_wrapper to the code, which could make our single-file implementation overly lengthy.

Answer 5 · 2023-12-07T20:26:43.000Z

Yeah so that's a bit unfortunate. I guess an alternative is to load the states in the normalize wrappers along with the saved model.