policy network
Closed this issue · 6 comments
Hi,
In your code, you get the policy sampled from a normal distribution with parameters mu and sigma estimated by NN. However, if the action is discrete integer, is it reasonable? Since the Normal distribution output float numbers. In such a case, how to calculate the KL divergence?
Thx
@huiwenzhang
Hi, thank you for checking my codes. Here, we implement this for continuous action space. So if you want to use PPO for discrete action space, you just change the policy network with softmax output. You can calculate variance and KL divergence with softmax distribution.
@takuseno Yeah, I do use softmax output. However, tensorflow doesn't have distribution class related to softmax output. Anyway, I think we can manually compute the KL divergence with respect to a discrete distribution according to definition of KL.
BTW, the part of code below is a little bit weird (defined in agent.py) :
def get_training_data(self):
obss = list(self.obss)
actions = list(self.actions)
deltas = []
returns = []
V = 0
for i in reversed(range(len(self.obss))):
reward = self.rewards[i]
value = self.values[i]
next_value = self.next_values[i]
delta = reward + self.gamma * next_value - value
V = delta + self.lam * self.gamma * V
deltas.append(V)
returns.append(V + value)
deltas = np.array(list(reversed(deltas)), dtype=np.float32)
returns = np.array(list(reversed(returns)), dtype=np.float32)
# standardize advantages
deltas = (deltas - deltas.mean()) / (deltas.std() + 1e-5)
self._reset_trajectories()
return obss, actions, list(returns), list(deltas)
Generally, this function is used to retrieve the advantage and estimated Q value along an episode. These two values will be included in policy loss and value_loss, which is critical to optimize the NN. However, the computing process defined here seems don't match the formula reported in the paper. Can you explain this?
@huiwenzhang
You're right. We can manually compute KL divergence with softmax distribution. Or actually, we have another option that we use tf.distributions.Categorical for discrete action space. See more detail here.
Then let me explain this implementation. This calculation is based on OpenAI baseline's implementation. In the paper, they use Generalized Advantage Estimation (GAE) to estimate advantages. delta
is normal advantage value used in A3C. V
is GAE version described like below in the paper.
To update value function, we use V + value
. On the other hand, to update policy function, we use V
. However, I confirm the names of variables are not appropriate to describe themselves. So I'll rename them soon.
Does my answer solve the problem?
@takuseno Thank you for your explanation. I checked the paper according to your notes. It's definitely correct. I have another doubt that I think you might share some of your ideas. PPO is a kind of on policy algorithm. It updates the policy with the most recently collected experience. How about we using the training data with a longer horizon? which means the sampling data may comes from the very beginning experience. Intuitively, I think it's bad, cause the gae estimation is based on the state values recorded when then are encountered, which might be very different from the current state values. However, I tried some basic policy gradient method, if I just sample from the recent experience, the performance is bad. In addition, I remembered John also said we shouldn't use the recent experience, it will make the training unstable. What's your opinion?
@joschu I also add John into this thread. Hope he can response to us.
@huiwenzhang I guess you are talking about online algorithm like Q learning. In deep reinforcement learning, online update has the critical issue of overfitting to recent experiences. That's why DQN introduces Experience Replay technique. In the paper of PPO, it assumes concurrent execution like A2C in Atari environment. Here we use simple continuous control tasks by default, which PPO easily achieve controls. However if we test PPO algorithm in Atari, we should use concurrent multiple agents just like the paper.
Here is the hyperparameters from the paper.
@takuseno I think most of RL algorithms are online, so does PPO. The concurrent version of PPO, namely DPPO, is also online in terms of each PPO actor. I am just curious about the capacity of the training buffer in PPO. Is it equal to Horizon (T) listed above? If yes, does it mean most gradient based methods update in a similar way? Namely, collect a batch of data, which the number is horizon T (sometimes times the number of actor, NT), sample a minibatch data for estimating gradient, advantage, value function etc. Then update weights based on estimated value to get a new policy, then clear the buffer, collect data generated by the current policy. Loop until the step reaches the threshold.