/ppo

Proximal Policy Optimization Algorithm implementation for the Deep Reinforcement Learning course @ MVA

Primary LanguagePythonMIT LicenseMIT

Proximal Policy Optimization (PPO)

python Code style: black Colab

This repository provides a working and clean code of the PPO algorithm using JAX and Haiku. To see it working, you can simply click on the Colab link above!

Our PPO agent trained on inverted pendulum. Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/inv-pend/".
Our PPO agent trained on reacher. Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/reacher/".
Our PPO agent trained on pendulum. Average episodic return over the rollouts. More is available by typing "tensorboard --logdir results/pendulum/".

Interested readers can have a look to our report that goes deeper into the details.

Contents


Environments

Agents

Tricks

Networks

  • Separated value and policy networks.
  • The standard deviation of the policy can be predicted by the policy network or fixed to a given value. softplus activation for making the std always positive.
  • Orthogonal initialization of the weights and constant initialization for the biases.
  • Activation functions are tanh.

Training

  • Linear annealing of the learning rates. Different learning rate for the policy and value networks.
  • Learning with minibatches. Normalized advantages at minibatch level.

Loss

  • Using Generalized Advantage Estimation (GAE).
  • Clipped ratio
  • Minimum between ratio x GAE and clipped_ratio x GAE
  • Clipped gradient norm

Environment wrappers

  • Normalization and clipping of the observation
  • Normalization and clipping of the rewards
  • Action normalization: the agent can predict actions between $[-1, 1]$, and the wrapper scale them back to the environment action range.

How to run it

The training loop is implemented in the ppo notebook. It contains instances of the agents tuned for each of the environments. We log the training metrics (losses, actions, rewards, etc) to a Tensorboard file, you can monitor it separately or within the notebook. After training is completed, a video of the agent is generated.

Fast and easy

Just click on this Colab

Run it locally

First you need to clone the repository. For that, you can use the following command line:

git clone git@github.com:emasquil/ppo.git

Then we recommend using a virtual environment, this can be done by the following:

python3 -m venv env
source env/bin/activate

Finally, in order to install the package, you can simply run:

pip install -e .

If you are planning on developing the package you will need to add [dev] at the end. This gives:

pip install -e .[dev]

This package uses MuJoCo environments, please install it by following these instructions.

Note that you might need to install the following.

sudo apt-get install -y xvfb ffmpeg freeglut3-dev libosmesa6-dev patchelf libglew-dev

After all the installs you should be ready to run the notebook locally.

Results

In the results directory you can find some plots, logs, and videos of the agents after being trained on the environments previously mentioned.

Contributing

Before any pull request, please make sure to format your code using the following:

black -l 120 ./

Inspirations

vwxyzjn/cleanrl
openai/baselines
DLR-RM/stable-baselines3
openai/spinningup
Costa Huang's blogpost
deepmind/acme