Proximal Policy Optimization (PPO)

This repository provides a working and clean code of the PPO algorithm using JAX and Haiku. To see it working, you can simply click on the Colab link above!


Our PPO agent trained on inverted pendulum.	Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/inv-pend/".

Our PPO agent trained on reacher.	Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/reacher/".

Our PPO agent trained on pendulum.	Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/pendulum/".

Interested readers can have a look to our report that goes deeper into the details.

Environments
Agents
Tricks
How to run it
Results
Contributing
Inspirations

Environments

Agents

random_agent: a random agent..
vanilla_ppo: our implementation of PPO.

Tricks

Networks

Separated value and policy networks.
The standard deviation of the policy can be predicted by the policy network or fixed to a given value. softplus activation for making the std always positive.
Orthogonal initialization of the weights and constant initialization for the biases.
Activation functions are tanh.

Training

Linear annealing of the learning rates. Different learning rate for the policy and value networks.
Learning with minibatches. Normalized advantages at minibatch level.

Loss

Using Generalized Advantage Estimation (GAE).
Clipped ratio
Minimum between ratio x GAE and clipped_ratio x GAE
Clipped gradient norm

Environment wrappers

Normalization and clipping of the observation
Normalization and clipping of the rewards
Action normalization: the agent can predict actions between $[-1, 1]$, and the wrapper scale them back to the environment action range.

How to run it

The training loop is implemented in the ppo notebook. It contains instances of the agents tuned for each of the environments. We log the training metrics (losses, actions, rewards, etc) to a Tensorboard file, you can monitor it separately or within the notebook. After training is completed, a video of the agent is generated.

Fast and easy

Just click on this

Run it locally

First you need to clone the repository. For that, you can use the following command line:

git clone git@github.com:emasquil/ppo.git

Then we recommend using a virtual environment, this can be done by the following:

python3 -m venv env
source env/bin/activate

Finally, in order to install the package, you can simply run:

pip install -e .

If you are planning on developing the package you will need to add [dev] at the end. This gives:

pip install -e .[dev]

This package uses MuJoCo environments, please install it by following these instructions.

Note that you might need to install the following.

sudo apt-get install -y xvfb ffmpeg freeglut3-dev libosmesa6-dev patchelf libglew-dev

After all the installs you should be ready to run the notebook locally.

Results

In the results directory you can find some plots, logs, and videos of the agents after being trained on the environments previously mentioned.

Contributing

Before any pull request, please make sure to format your code using the following:

black -l 120 ./

Inspirations

vwxyzjn/cleanrl
openai/baselines
DLR-RM/stable-baselines3
openai/spinningup
Costa Huang's blogpost
deepmind/acme

emasquil/ppo