This is the implementation of Actor with Variance Estimated Critic (AVEC). The code is a fork of Stable Baselines.
This repository supports Tensorflow versions from 1.14.0 to 2.3.1. Works on GPUs.
Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages OpenMPI and zlib. Those can be installed as follows
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
Installation of system packages on Mac requires Homebrew. With Homebrew installed, run the following:
brew install cmake openmpi
To install the repository on Windows, please look at the documentation.
In order to use your RL algorithm with AVEC, simply use avec_coef=1.
and vf_coef=0.
(or value_coef=0.
for SAC).
Run python run.py
to start the training, monitor the learning logs (.csv, tensorboard) and reproduce the results of the paper with the environments and seeds of your choice.
Here is a quick example of how to train and run AVEC-PPO2 on a AntBullet environment:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2
env = gym.make('AntBulletEnv-v0')
model = PPO2(MlpPolicy, env, verbose=1, avec_coef=1., vf_coef=0.)
model.learn(total_timesteps=1000000)
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
env.close()
Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:
from stable_baselines import PPO2
model = PPO2('MlpPolicy', 'AntBulletEnv-v0', avec_coef=1., vf_coef=0.).learn(1000000)
Here is a quick example of how to train and run AVEC-TRPO on a AntBullet environment:
from stable_baselines.trpo_mpi import TRPO
model = TRPO('MlpPolicy', 'AntBulletEnv-v0', avec_coef=1., vf_coef=0.).learn(1000000)
Finally, here is a quick example of how to train and run AVEC-SAC on a AntBullet environment:
from stable_baselines.sac import SAC
model = SAC('CustomSACPolicy', 'AntBulletEnv-v0', avec_coef=1., value_coef=0.).learn(1000000)
Please read the documentation for more examples.
Some of the examples use MuJoCo (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from www.mujoco.org). Instructions on setting up MuJoCo can be found here.