By Noah van Grinsven, Anton Steenvoorden, Tessa Wagenaar, Laurens Weitkamp.
Note: much better viewed in a markdown reader that can properly render LaTeX.
Nowadays many of the RL policy gradient methods use Generalized Advantage Estimation (GAE) as a baseline in actor-critic methods. Schulman et al.1 state that GAE reduces the variance of the policy gradient estimate when compared to other baselines, like advantage estimation. This comes at a cost, because GAE can introduce a bias into this estimate. To check this we will focus on $n-$step bootstrapping in actor-critic methods, which traditionally exhibit high variance for higher
What is the effect of (generalized) advantage estimation on the return in $n$-step bootstrapping?
This blogpost is meant to answer that question. Since GAE reduces the variance, we expect it will improve the performance in high variance problems, so in
First, we will give a quick overview of actor critic methods,
In reinforcement learning we typically want to maximize the total expected reward, which can be done using various methods. We can for example choose to learn the value function(s) for each state and infer the policy from this, or learn the policy directly through parameterization of the policy. Actor-critic methods combine the two approaches: the actor is a parameterized policy whose output matches the number of actions (
This estimate can be biased (due to estimation and bootstrapping) and can exhibit high variance (a common problem in policy gradient based methods).
Monte Carlo based methods (such as actor critic) have one big disadvantage: we have to wait for the end of an episode to perform a backup. We can tackle this disadvantage by performing
In code, estimating the advantage function through bootstrapping looks like this:
def A(Rt, rewards, values, gamma): # R_t here is actually v(s_t), our bootstrapped estimate
returns = []
for step in reversed(range(len(rewards))):
Rt = rewards[step] + gamma * Rt
returns.insert(0, Rt)
advantage = returns - values
return advantage
Where the rewards and values are vectors of size
We now turn to an idea proposed in the paper High Dimensional Continuous Control Using Generalized Advantage Estimation1 by Schulman et al. 2016. The advantage function reduces variance, but the authors claim we can use a better
def GAE(next_value, rewards, values, gamma, GAE_lambda):
values = values + [next_value]
gae = 0
returns = []
for step in reversed(range(len(rewards))):
Qsa = rewards[step] + gamma * values[step + 1]
Vs = values[step]
delta = Qsa - Vs
gae = delta + gamma * GAE_lambda * gae
returns.insert(0, gae)
return returns
To answer our research question "What is the effect of (generalized) advantage estimation on the return in $n$-step bootstrapping?", we will vary the learning rate over different
For our experiment we have chosen to use the CartPole-v0 environment of the OpenAI gym python package5.
The CartPole-v0 environment has two actions, namely push left and push right. The goal is to balance a pole on top of a cart (hence cartpole) for as long as possible (maximum of 200 time steps) and the input our agent receives is a vector of four values: pole position, cart velocity, pole angle and pole velocity at tip. A video of the environment with a random policy can be seen below on the left hand side.
Video was taken from OpenAI Gym
This environment was chosen for its simplicity, while still having a quite large state space. Additionally, CartPole-v0 was used in the original by Schulman et al. 20161, although a different update method was used. More difficult environments have not been tested due to the limited time available for this project.
We train the agent using a deep neural network where the input is transformed into shared features (a vector in
import torch.nn as nn
class ActorCriticMLP(nn.Module):
def __init__(self, input_dim, n_acts, n_hidden=32):
super(ActorCriticMLP, self).__init__()
self.input_dim = input_dim
self.n_acts = n_acts
self.n_hidden = n_hidden
self.features = nn.Sequential(
nn.Linear(self.input_dim, self.n_hidden),
nn.ReLU()
)
self.value_function = nn.Sequential(
nn.Linear(self.n_hidden, 1)
)
self.policy = nn.Sequential(
nn.Linear(self.n_hidden, n_acts),
nn.Softmax(dim=0)
)
def forward(self, obs):
obs = obs.float()
obs = self.features(obs)
probs = self.policy(obs)
value = self.value_function(obs)
return probs, value
In our experiment we performed a grid search over the learning rate and the n-step return.
For gamma we took
The experiment first searches for the optimal learning rate for each
As learning rate for the regular Advantage Estimation we use
GAE uses a larger learning rate, because it reduces the variance more than the Advantage Estimation. This makes it able to use larger updates than the normal Advantage Estimation.
The search for these optimal parameters is cubic, as we iterate over 3 separate parameters, namely
To ensure reproducibility, we have manually set the seeds for PyTorch and the gym environments.
We use in total 5 seeds, namely
seeds = [i + num_envs for i in range(num_envs)]
To determine which setup works best, we first combine the results of all the seeds, sorted by return type,
Generalized Advantage Estimation | 0.03 | 0.01 | 0.01 | 0.01 | 0.03 | 0.01, 0.05 | 0.01, 0.03, 0.07, 0.09 | 0.03, 0.07 | 0.03 |
Advantage Estimation | 0.001, 0.003 | 0.001 | 0.01 | 0.009 | 0.005 | 0.007 | 0.007 | 0.005 | 0.009 |
Table 1: Optimal learning rate per $n$-step. The value in each cell corresponds to learning rate which yielded the greatest average return. When there are multiple values present in a cell, the results were similar up to 1.0 difference in the return. An example for GAE, $n=150$; 0.03 yields a return of $180.7$ whereas 0.05 yields a return of $181.8$.
For Generalized Advantage Estimation we see that, around
Figure 1: These results are for the CartPole-v0 environment. We show results for the best learning rate of the GAE and AE returns. The graphs show the mean with surrounding it one standard deviation. The axis label "Number of steps (in thousands) refers to the steps taken in the environment themselves, and needs to also be multiplied by the number of agents. The y-axis is averaged over the seeds and the rewards observed at 1000-step interval. The returns are averaged by freezing the weights at each 1000th step and running an agent on 10 different episodes.
The learning curves have been plotted in Figure 1, which show that GAE does not work for low values of
For
For
For
What is the effect of (generalized) advantage estimation on the return in $n$-step bootstrapping?
The idea of using GAE as a baseline in actor-critic methods is that is reduces variance while introducing a tolerable amount of bias compared to using the more standard AE. As
What we see is that GAE does indeed outperform AE for higher value of
Now it is important to keep in mind that the way these methods are tested is quite limited. In this experiment the returns showed are averaged over a total of 5 seeds, which could give misleading results. Also, the results obtained in this experiment could be specific for this environment. For further research we would suggest to test these methods on other environments to see if our conclusion generalizes well.
<script src="https://unpkg.com/applause-button/dist/applause-button.js"></script> <style> applause-button .count-container { margin:0 auto; } applause-button .count-container .count { text-size:1.2em; } </style>Footnotes
-
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. ↩ ↩2 ↩3 ↩4
-
https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py ↩
-
Actually, bootstrapping is what defines actor critic methods when contrasted to vanilla policy gradient methods. ↩
-
For the sake of completeness, we have briefly tested the model on a different environment, MountaintCar-v0, but we did not manage to converge for a selection of learning rates and $n$-steps in due time for this project. ↩